Skip to content

Commit

Permalink
Merge branch 'main' into doc-redesign
Browse files Browse the repository at this point in the history
  • Loading branch information
stevhliu authored Aug 6, 2024
2 parents d5c7e96 + 26a9443 commit 31b60e4
Show file tree
Hide file tree
Showing 201 changed files with 5,865 additions and 1,091 deletions.
2 changes: 1 addition & 1 deletion docker/consistency.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ RUN pip install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir --upgrade 'torch' --index-url https://download.pytorch.org/whl/cpu
# tensorflow pin matching setup.py
RUN uv pip install --no-cache-dir "tensorflow-cpu<2.16" "tf-keras<2.16"
RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,quality,speech,vision,testing]"
RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,quality,torch-speech,vision,testing]"
RUN git lfs install

RUN pip uninstall -y transformers
Expand Down
6 changes: 6 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@
title: Generation with LLMs
- local: generation_strategies
title: Customize the generation strategy
- local: kv_cache
title: Best Practices for Generation with Cache
- local: llm_tutorial_optimization
title: Getting the most out of LLMs
- local: perplexity
Expand Down Expand Up @@ -451,6 +453,8 @@
title: MADLAD-400
- local: model_doc/mamba
title: Mamba
- local: model_doc/mamba2
title: mamba2
- local: model_doc/marian
title: MarianMT
- local: model_doc/markuplm
Expand Down Expand Up @@ -481,6 +485,8 @@
title: MT5
- local: model_doc/mvp
title: MVP
- local: model_doc/nemotron
title: Nemotron
- local: model_doc/nezha
title: NEZHA
- local: model_doc/nllb
Expand Down
111 changes: 0 additions & 111 deletions docs/source/en/generation_strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,117 +174,6 @@ An increasing sequence: one, two, three, four, five, six, seven, eight, nine, te
```


## KV Cache Quantization

The `generate()` method supports caching keys and values to enhance efficiency and avoid re-computations. However the key and value
cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models.
Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed.

KV Cache quantization in `transformers` is largely inspired by the paper [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache]
(https://arxiv.org/abs/2402.02750) and currently supports `quanto` and `HQQ` as backends. For more information on the inner workings see the paper.

To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`.
Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a [`QuantizedCacheConfig`] class.
One has to indicate which quantization backend to use in the [`QuantizedCacheConfig`], the default is `quanto`.

<Tip warning={true}>

Cache quantization can be detrimental if the context length is short and there is enough GPU VRAM available to run without cache quantization.

</Tip>


```python
>>> import torch
>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
>>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)

>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music because it's loud and energetic. It's a great way to express myself and rel

>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music because it's loud and energetic. I like to listen to it when I'm feeling
```

## KV Cache Offloading

Similarly to KV cache quantization, this strategy aims to reduce GPU VRAM usage.
It does so by moving the KV cache for most layers to the CPU.
As the model's `forward()` method iterates over the layers, this strategy maintains the current layer cache on the GPU.
At the same time it asynchronously prefetches the next layer cache as well as sending the previous layer cache back to the CPU.
Unlike KV cache quantization, this strategy always produces the same result as the default KV cache implementation.
Thus, it can serve as a drop-in replacement or a fallback for it.

Depending on your model and the characteristics of your generation task (size of context, number of generated tokens, number of beams, etc.)
you may notice a small degradation in generation throughput compared to the default KV cache implementation.

To enable KV cache offloading, pass `cache_implementation="offloaded"` in the `generation_config`.

```python
>>> import torch
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> ckpt = "microsoft/Phi-3-mini-4k-instruct"

>>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
>>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
>>> inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device)

>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded")
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.

>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23)
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
```

<Tip warning={true}>

Cache offloading requires a GPU and can be slower than the default KV cache. Use it if you are getting CUDA out of memory errors.

</Tip>

The example below shows how KV cache offloading can be used as a fallback strategy.
```python
>>> import torch
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> def resilient_generate(model, *args, **kwargs):
... oom = False
... try:
... return model.generate(*args, **kwargs)
... except torch.cuda.OutOfMemoryError as e:
... print(e)
... print("retrying with cache_implementation='offloaded'")
... oom = True
... if oom:
... torch.cuda.empty_cache()
... kwargs["cache_implementation"] = "offloaded"
... return model.generate(*args, **kwargs)
...
...
>>> ckpt = "microsoft/Phi-3-mini-4k-instruct"
>>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
>>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
>>> prompt = ["okay "*1000 + "Fun fact: The most"]
>>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
>>> beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, }
>>> out = resilient_generate(model, **inputs, **beams)
>>> responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True)
```

On a GPU with 50 GB of RAM, running this code will print
```
CUDA out of memory. Tried to allocate 4.83 GiB. GPU
retrying with cache_implementation='offloaded'
```
before successfully generating 40 beams.


## Watermarking

The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green".
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ Check the table below to see whether a model supports PyTorch, TensorFlow, or JA
| [M2M100](model_doc/m2m_100) ||||
| [MADLAD-400](model_doc/madlad-400) ||||
| [Mamba](model_doc/mamba) ||||
| [mamba2](model_doc/mamba2) ||||
| [Marian](model_doc/marian) ||||
| [MarkupLM](model_doc/markuplm) ||||
| [Mask2Former](model_doc/mask2former) ||||
Expand Down Expand Up @@ -216,6 +217,7 @@ Check the table below to see whether a model supports PyTorch, TensorFlow, or JA
| [MusicGen Melody](model_doc/musicgen_melody) ||||
| [MVP](model_doc/mvp) ||||
| [NAT](model_doc/nat) ||||
| [Nemotron](model_doc/nemotron) ||||
| [Nezha](model_doc/nezha) ||||
| [NLLB](model_doc/nllb) ||||
| [NLLB-MOE](model_doc/nllb-moe) ||||
Expand Down
18 changes: 18 additions & 0 deletions docs/source/en/internal/generation_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -386,18 +386,36 @@ A [`Constraint`] can be used to force the generation to include specific tokens
- get_seq_length
- reorder_cache

[[autodoc]] OffloadedCache
- update
- prefetch_layer
- evict_previous_layer

[[autodoc]] StaticCache
- update
- get_seq_length
- reset

[[autodoc]] HybridCache
- update
- reset

[[autodoc]] SlidingWindowCache
- update
- reset

[[autodoc]] EncoderDecoderCache
- get_seq_length
- to_legacy_cache
- from_legacy_cache
- reset
- reorder_cache

[[autodoc]] MambaCache
- update_conv_state
- update_ssm_state
- reset

## Watermark Utils

[[autodoc]] WatermarkDetector
Expand Down
Loading

0 comments on commit 31b60e4

Please sign in to comment.