Skip to content

Commit

Permalink
Merge branch 'main' of github.com:huggingface/transformers into add-d…
Browse files Browse the repository at this point in the history
…eci-lm
  • Loading branch information
ArthurZucker committed Dec 16, 2023
2 parents 400d129 + 238d2e3 commit 36071af
Show file tree
Hide file tree
Showing 222 changed files with 14,619 additions and 1,281 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/self-push-amd-mi210-caller.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ on:
jobs:
run_amd_ci:
name: AMD mi210
if: (cancelled() != true) && ((github.event_name == 'push') && (github.ref_name == 'main' || startsWith(github.ref_name, 'run_amd_push_ci_caller')))
if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_push_ci_caller')))
uses: ./.github/workflows/self-push-amd.yml
with:
gpu_flavor: mi210
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/self-push-amd-mi250-caller.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ on:
jobs:
run_amd_ci:
name: AMD mi250
if: (cancelled() != true) && ((github.event_name == 'push') && (github.ref_name == 'main' || startsWith(github.ref_name, 'run_amd_push_ci_caller')))
if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_push_ci_caller')))
uses: ./.github/workflows/self-push-amd.yml
with:
gpu_flavor: mi250
Expand Down
17 changes: 9 additions & 8 deletions README.md

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions README_es.md

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions README_hd.md

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions README_ja.md

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions README_ko.md

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions README_zh-hans.md

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions README_zh-hant.md

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,8 @@
title: Overview
- local: quantization
title: Quantization
- local: trainer
title: Trainer
- sections:
- local: perf_train_gpu_one
title: Methods and tools for efficient training on a single GPU
Expand All @@ -149,7 +151,7 @@
- local: perf_train_tpu_tf
title: Training on TPU with TensorFlow
- local: perf_train_special
title: Training on Specialized Hardware
title: PyTorch training on Apple silicon
- local: perf_hardware
title: Custom hardware for training
- local: hpo_train
Expand Down Expand Up @@ -743,6 +745,8 @@
title: TVP
- local: model_doc/vilt
title: ViLT
- local: model_doc/vipllava
title: VipLlava
- local: model_doc/vision-encoder-decoder
title: Vision Encoder Decoder Models
- local: model_doc/vision-text-dual-encoder
Expand Down
16 changes: 7 additions & 9 deletions docs/source/en/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ reading the whole sentence but using a mask inside the model to hide the future

### channel

Color images are made up of some combination of values in three channels - red, green, and blue (RGB) - and grayscale images only have one channel. In πŸ€— Transformers, the channel can be the first or last dimension of an image's tensor: [`n_channels`, `height`, `width`] or [`height`, `width`, `n_channels`].
Color images are made up of some combination of values in three channels: red, green, and blue (RGB) and grayscale images only have one channel. In πŸ€— Transformers, the channel can be the first or last dimension of an image's tensor: [`n_channels`, `height`, `width`] or [`height`, `width`, `n_channels`].

### connectionist temporal classification (CTC)

Expand All @@ -116,6 +116,7 @@ A type of layer in a neural network where the input matrix is multiplied element

Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance
receiving a distinct data slice. The processing is done in parallel and all setups are synchronized at the end of each training step.

Learn more about how DataParallel works [here](perf_train_gpu_many#dataparallel-vs-distributeddataparallel).

### decoder input IDs
Expand Down Expand Up @@ -165,8 +166,7 @@ embeddings `[batch_size, sequence_length, config.intermediate_size]` can account
use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the
computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`
individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n =
sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
**equivalent** result.

For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the number of output
Expand All @@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw

* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
* [`Wav2Vec2ForCTC`] ia a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`].
* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`].

## I

Expand Down Expand Up @@ -232,9 +232,7 @@ is added for "RA" and "M":
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
```

These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
the sentence to the tokenizer, which leverages the Rust implementation of [πŸ€—
Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding the sentence to the tokenizer, which leverages the Rust implementation of [πŸ€— Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.

```python
>>> inputs = tokenizer(sequence)
Expand Down Expand Up @@ -383,7 +381,7 @@ self-supervised objective, which can be reading the text and trying to predict t
modeling](#causal-language-modeling)) or masking some words and trying to predict them (see [masked language
modeling](#masked-language-modeling-mlm)).

Speech and vision models have their own pretraining objectives. For example, Wav2Vec2 is a speech model pretrained on a contrastive task which requires the model to identify the "true" speech representation from a set of "false" speech representations. On the other hand, BEiT is a vision model pretrained on a masked image modeling task which masks some of the image patches and requires the model to predict the masked patches (similar to the masked language modeling objective).
Speech and vision models have their own pretraining objectives. For example, Wav2Vec2 is a speech model pretrained on a contrastive task which requires the model to identify the "true" speech representation from a set of "false" speech representations. On the other hand, BEiT is a vision model pretrained on a masked image modeling task which masks some of the image patches and requires the model to predict the masked patches (similar to the masked language modeling objective).

## R

Expand Down Expand Up @@ -518,7 +516,7 @@ A form of model training in which data provided to the model is not labeled. Uns

### Zero Redundancy Optimizer (ZeRO)

Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensorparallel--tp-),
Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensor-parallelism-tp),
except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need
to be modified. This method also supports various offloading techniques to compensate for limited GPU memory.
Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ Flax), PyTorch, and/or TensorFlow.
| [VAN](model_doc/van) | βœ… | ❌ | ❌ |
| [VideoMAE](model_doc/videomae) | βœ… | ❌ | ❌ |
| [ViLT](model_doc/vilt) | βœ… | ❌ | ❌ |
| [VipLlava](model_doc/vipllava) | βœ… | ❌ | ❌ |
| [Vision Encoder decoder](model_doc/vision-encoder-decoder) | βœ… | βœ… | βœ… |
| [VisionTextDualEncoder](model_doc/vision-text-dual-encoder) | βœ… | βœ… | βœ… |
| [VisualBERT](model_doc/visual_bert) | βœ… | ❌ | ❌ |
Expand Down
Loading

0 comments on commit 36071af

Please sign in to comment.