multimodal: update doc and model path in launcher (#48)

* update multimodal doc and requirement * update model path --------- Co-authored-by: Xiaotong Chen <“[email protected]”>
modelscope · Dec 20, 2024 · 0174d94 · 0174d94
1 parent 97108ec
commit 0174d94
Show file tree

Hide file tree

Showing 3 changed files with 15 additions and 9 deletions.
diff --git a/docs/sphinx/vlm/vlm_offline_inference_en.rst b/docs/sphinx/vlm/vlm_offline_inference_en.rst
@@ -97,26 +97,30 @@ You can also use OpenAI's Python client library:
                },
          ],
       }],
-      stream=False,
+      stream=True,
       max_completion_tokens=1024,
       temperature=0.1,
    )
 
+   full_response = ""
+   for chunk in response:
+      full_response += chunk.choices[0].delta.content
+      print(".", end="")
+
+   print(f"\nFull Response: \n{full_response}")
+
 Launching with CLI
 -------------------------
 You can also opt to install dashinfer-vlm locally and use command line to launch server.
 
 1. Pull dashinfer docker image (see :ref:`docker-label`)
 2. Install TensorRT Python package, and download TensorRT GA build from NVIDIA Developer Zone.
 
-Example: TensorRT 10.6.0.26 for CUDA 12.6, Linux x86_64
-
 .. code-block:: bash
 
-   pip install tensorrt
-   wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
-   tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
-   export LD_LIBRARY_PATH=`pwd`/TensorRT-10.6.0.26/lib
+   wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/tars/TensorRT-10.5.0.18.Linux.x86_64-gnu.cuda-12.6.tar.gz
+   tar -xvzf TensorRT-10.5.0.18.Linux.x86_64-gnu.cuda-12.6.tar.gz
+   export LD_LIBRARY_PATH=`pwd`/TensorRT-10.5.0.18/lib
 
 3. Install dashinfer Python Package from `release <https://github.com/modelscope/dash-infer/releases>`_
 4. Install dashinfer-vlm: ``pip install dashinfer-vlm``.

diff --git a/multimodal/dashinfer_vlm/api_server/server.py b/multimodal/dashinfer_vlm/api_server/server.py
@@ -76,7 +76,8 @@ def init():
     context.set("chat_format", chat_format)
 
     # -----------------------Convert Model------------------------
-    output_dir = "/root/.cache/as_model/" + model.split("/")[-1]
+    home_dir = os.environ.get("HOME") or "/root"
+    output_dir = os.path.join(home_dir, ".cache/as_model/", model.split("/")[-1])
     model_name = "model"
     data_type = "bfloat16"
 

diff --git a/multimodal/requirements.txt b/multimodal/requirements.txt
@@ -1,3 +1,4 @@
+tensorrt==10.5.0
 av
 numpy==1.24.3
 requests==2.32.3
@@ -6,7 +7,7 @@ transformers>=4.45.0
 cachetools>=5.4.0
 six
 tiktoken
-openai==1.52.2
+openai>=1.56.2
 shortuuid
 fastapi
 pydantic_settings