Skip to content

Commit

Permalink
Update perf_infer_gpu_one.md: fix a typo (#35441)
Browse files Browse the repository at this point in the history
  • Loading branch information
martin0258 authored Dec 29, 2024
1 parent 5c75087 commit 90f256c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,7 @@ generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
```

To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU:
To load a model in 8-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU:

```py
max_memory_mapping = {0: "1GB", 1: "2GB"}
Expand Down

0 comments on commit 90f256c

Please sign in to comment.