openvinotoolkit · mzegla · Dec 20, 2024 · Dec 20, 2024 · Dec 21, 2024 · Jan 7, 2025
diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md
@@ -5,18 +5,9 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo
 
 > **Note:** This demo was tested on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9.
 
-## Get the docker image
-
-Build the image from source to try the latest enhancements in this feature.
-```bash
-git clone https://github.com/openvinotoolkit/model_server.git
-cd model_server
-make release_image GPU=1
-```
-It will create an image called `openvino/model_server:latest`.
-> **Note:** This operation might take 40min or more depending on your build host.
-> **Note:** `GPU` parameter in image build command is needed to include dependencies for GPU device.
-> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image.
+## Prerequisites
+- **For Linux users**: Installed Docker Engine 
+- **For Windows users**: Installed OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)
 
 ## Model preparation
 > **Note** Python 3.9 or higher is need for that step
@@ -25,12 +16,12 @@ That ensures faster initialization time, better performance and lower memory con
 LLM engine parameters will be defined inside the `graph.pbtxt` file.
 
 Install python dependencies for the conversion script:
-```bash
+```console
 pip3 install -U -r demos/common/export_models/requirements.txt
 ```
 
 Run optimum-cli to download and quantize the model:
-```bash
+```console
 mkdir models 
 python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models 
 ```
@@ -59,11 +50,10 @@ models
         └── tokenizer.json
 ```
 
-The default configuration of the `LLMExecutor` should work in most cases but the parameters can be tuned inside the `node_options` section in the `graph.pbtxt` file. 
-Note that the `models_path` parameter in the graph file can be an absolute path or relative to the `base_path` from `config.json`.
-Check the [LLM calculator documentation](../../docs/llm/reference.md) to learn about configuration options.
+The default configuration  should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
+
 
-## Start-up
+## Deploying with Docker
 
 ### CPU
 
@@ -82,10 +72,46 @@ python demos/common/export_models/export_model.py text_generation --source_model
 docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
 ```
 
+## Deploying on Bare Metal
+
+Assuming you have unpacked model server package to your current working directory run `setupvars` script for environment setup:
+
+**Windows Command Line**
+```bat
+./ovms/setupvars.bat
+```
+
+**Windows PowerShell**
+```powershell
+./ovms/setupvars.ps1
+```
+
+### CPU
+
+In model preparation section, configuration is set to load models on CPU, so you can simply run the binary pointing to the configuration file and selecting port for the HTTP server to expose inference endpoint.
+
+```bat
+ovms --rest_port 8000 --config_path ./models/config.json
+```
+
+
+### GPU
+
+In case you want to use GPU device to run the generation, export the models with precision matching the GPU capacity and adjust pipeline configuration.
+It can be applied using the commands below:
+```console
+python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models
+```
+Then rerun above command as configuration file has already been adjusted to deploy model on GPU:
+
+```bat
+ovms --rest_port 8000 --config_path ./models/config.json
+```
+
 ### Check readiness
 
 Wait for the model to load. You can check the status with a simple command:
-```bash
+```console
 curl http://localhost:8000/v1/config
 ```
 ```json
@@ -112,7 +138,7 @@ Chat endpoint is expected to be used for scenarios where conversation context sh
 Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template.
 
 ### Unary:
-```bash
+```console
 curl http://localhost:8000/v3/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
@@ -156,7 +182,7 @@ curl http://localhost:8000/v3/chat/completions \
 ```
 
 A similar call can be made with a `completion` endpoint:
-```bash
+```console
 curl http://localhost:8000/v3/completions \
   -H "Content-Type: application/json" \
   -d '{
@@ -192,7 +218,7 @@ curl http://localhost:8000/v3/completions \
 The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
 
 Install the client library:
-```bash
+```console
 pip3 install openai
 ```
 ```python
@@ -219,7 +245,7 @@ It looks like you're testing me!
 ```
 
 A similar code can be applied for the completion endpoint:
-```bash
+```console
 pip3 install openai
 ```
 ```python
@@ -250,7 +276,7 @@ It looks like you're testing me!
 
 OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
 It can be demonstrated using benchmarking app from vLLM repository:
-```bash
+```console
 git clone --branch v0.6.0 --depth 1 https://github.com/vllm-project/vllm
 cd vllm
 pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu