Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM demos adjustments for Windows #2940

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 50 additions & 24 deletions demos/continuous_batching/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,9 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo

> **Note:** This demo was tested on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9.

## Get the docker image

Build the image from source to try the latest enhancements in this feature.
```bash
git clone https://github.com/openvinotoolkit/model_server.git
cd model_server
make release_image GPU=1
```
It will create an image called `openvino/model_server:latest`.
> **Note:** This operation might take 40min or more depending on your build host.
> **Note:** `GPU` parameter in image build command is needed to include dependencies for GPU device.
> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image.
## Prerequisites
- **For Linux users**: Installed Docker Engine
- **For Windows users**: Installed OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)

## Model preparation
> **Note** Python 3.9 or higher is need for that step
Expand All @@ -25,12 +16,12 @@ That ensures faster initialization time, better performance and lower memory con
LLM engine parameters will be defined inside the `graph.pbtxt` file.

Install python dependencies for the conversion script:
```bash
```console
pip3 install -U -r demos/common/export_models/requirements.txt
```

Run optimum-cli to download and quantize the model:
```bash
```console
mkdir models
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
```
Expand Down Expand Up @@ -59,11 +50,10 @@ models
└── tokenizer.json
```

The default configuration of the `LLMExecutor` should work in most cases but the parameters can be tuned inside the `node_options` section in the `graph.pbtxt` file.
Note that the `models_path` parameter in the graph file can be an absolute path or relative to the `base_path` from `config.json`.
Check the [LLM calculator documentation](../../docs/llm/reference.md) to learn about configuration options.
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.


## Start-up
## Deploying with Docker

### CPU

Expand All @@ -82,10 +72,46 @@ python demos/common/export_models/export_model.py text_generation --source_model
docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
```

## Deploying on Bare Metal

Assuming you have unpacked model server package to your current working directory run `setupvars` script for environment setup:

**Windows Command Line**
```bat
./ovms/setupvars.bat
```

**Windows PowerShell**
```powershell
./ovms/setupvars.ps1
```

### CPU

In model preparation section, configuration is set to load models on CPU, so you can simply run the binary pointing to the configuration file and selecting port for the HTTP server to expose inference endpoint.

```bat
ovms --rest_port 8000 --config_path ./models/config.json
```


### GPU

In case you want to use GPU device to run the generation, export the models with precision matching the GPU capacity and adjust pipeline configuration.
It can be applied using the commands below:
```console
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models
```
Then rerun above command as configuration file has already been adjusted to deploy model on GPU:

```bat
ovms --rest_port 8000 --config_path ./models/config.json
```

### Check readiness

Wait for the model to load. You can check the status with a simple command:
```bash
```console
curl http://localhost:8000/v1/config
```
```json
Expand All @@ -112,7 +138,7 @@ Chat endpoint is expected to be used for scenarios where conversation context sh
Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template.

### Unary:
```bash
```console
curl http://localhost:8000/v3/chat/completions \
-H "Content-Type: application/json" \
-d '{
Expand Down Expand Up @@ -156,7 +182,7 @@ curl http://localhost:8000/v3/chat/completions \
```

A similar call can be made with a `completion` endpoint:
```bash
```console
curl http://localhost:8000/v3/completions \
-H "Content-Type: application/json" \
-d '{
Expand Down Expand Up @@ -192,7 +218,7 @@ curl http://localhost:8000/v3/completions \
The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:

Install the client library:
```bash
```console
pip3 install openai
```
```python
Expand All @@ -219,7 +245,7 @@ It looks like you're testing me!
```

A similar code can be applied for the completion endpoint:
```bash
```console
pip3 install openai
```
```python
Expand Down Expand Up @@ -250,7 +276,7 @@ It looks like you're testing me!

OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
It can be demonstrated using benchmarking app from vLLM repository:
```bash
```console
git clone --branch v0.6.0 --depth 1 https://github.com/vllm-project/vllm
cd vllm
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
Expand Down
Loading