diff --git a/docs/source/en/model_doc/dbrx.md b/docs/source/en/model_doc/dbrx.md index 1192c22021f24a..33435462b3e024 100644 --- a/docs/source/en/model_doc/dbrx.md +++ b/docs/source/en/model_doc/dbrx.md @@ -36,14 +36,63 @@ This model was contributed by [eitan-turok](https://huggingface.co/eitanturok) a ## Usage Examples -The `generate()` method can be used to generate text using DBRX. +The `generate()` method can be used to generate text using DBRX. You can generate using the standard attention implementation, flash-attention, and the PyTorch scaled dot product attention. The last two attention implementations give speed ups. ```python from transformers import DbrxForCausalLM, AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", token="YOUR_HF_TOKEN") -model = DbrxForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, token="YOUR_HF_TOKEN") +model = DbrxForCausalLM.from_pretrained( + "databricks/dbrx-instruct", + device_map="auto", + torch_dtype=torch.bfloat16, + token="YOUR_HF_TOKEN", + ) + +input_text = "What does it take to build a great LLM?" +messages = [{"role": "user", "content": input_text}] +input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") + +outputs = model.generate(**input_ids, max_new_tokens=200) +print(tokenizer.decode(outputs[0])) +``` + +If you have flash-attention installed (`pip install flash-attn`), it is possible to generate faster. (The HuggingFace documentation for flash-attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2).) +```python +from transformers import DbrxForCausalLM, AutoTokenizer +import torch + +tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", token="YOUR_HF_TOKEN") +model = DbrxForCausalLM.from_pretrained( + "databricks/dbrx-instruct", + device_map="auto", + torch_dtype=torch.bfloat16, + token="YOUR_HF_TOKEN", + attn_implementation="flash_attention_2", + ) + +input_text = "What does it take to build a great LLM?" +messages = [{"role": "user", "content": input_text}] +input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") + +outputs = model.generate(**input_ids, max_new_tokens=200) +print(tokenizer.decode(outputs[0])) +``` + +You can also generate faster using the PyTorch scaled dot product attention. (The HuggingFace documentation for scaled dot product attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).) +```python +from transformers import DbrxForCausalLM, AutoTokenizer +import torch + +tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", token="YOUR_HF_TOKEN") +model = DbrxForCausalLM.from_pretrained( + "databricks/dbrx-instruct", + device_map="auto", + torch_dtype=torch.bfloat16, + token="YOUR_HF_TOKEN", + attn_implementation="sdpa", + ) input_text = "What does it take to build a great LLM?" messages = [{"role": "user", "content": input_text}]