Add sglang example #92

phatvo9 · 2024-11-14T10:13:49Z

No description provided.

luv-bansal

Almost same suggestions from lmdeploy PR

luv-bansal · 2024-11-18T13:54:27Z

models/model_upload/llms/sglang-llama-3_2-1b-instruct/1/model.py

+import subprocess
+import sys
+import threading


These dependencies are using in model.py, and can be removed

luv-bansal · 2024-11-18T13:55:59Z

models/model_upload/llms/sglang-llama-3_2-1b-instruct/requirements.txt

+orjson
+python-multipart
+
+--extra-index-url https://flashinfer.ai/whl/cu121/torch2.4/


In prod and deb we have cuda 12.4, I'm not sure if it works with this cu121, need to be verified

But I tested on q22, which also has cuda 12.4 where prediction is successful but don't thing this will be a issue

luv-bansal · 2024-11-18T13:57:37Z

models/model_upload/llms/sglang-llama-3_2-1b-instruct/requirements.txt

I used below requirements with dependencies versions to test locally and it worked. I think it's better to include requirements with it's versions here, because before I don't know why but I was getting error when I didn't specify dependencies versions

torch==2.4.0 tokenizers==0.20.2 transformers==4.46.2 accelerate==0.34.2 scipy==1.10.1 optimum==1.23.3 xformers==0.0.27.post2 einops==0.8.0 requests==2.32.2 packaging ninja protobuf==3.20.0 sglang[all]==0.3.5.post2 orjson==3.10.11 python-multipart==0.0.17 --extra-index-url https://flashinfer.ai/whl/cu121/torch2.4/ flashinfer ``

luv-bansal · 2024-11-18T14:18:46Z

@phatvo9 I uploaded the model on prod, upload is successful but predictions are failing. And looking at prod logs I've got below

[2024-11-18 14:10:41 TP0] Traceback (most recent call last):
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 171, in __init__
    self.capture()
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 221, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 243, in capture_one_batch_size
    self.model_runner.attn_backend.init_forward_metadata_capture_cuda_graph(
  File "/venv/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 187, in init_forward_metadata_capture_cuda_graph
    self.indices_updater_decode.update(
  File "/venv/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 352, in update_single_wrapper
    self.call_begin_forward(
  File "/venv/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 441, in call_begin_forward
    create_flashinfer_kv_indices_triton[(bs,)](
  File "/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
  File "/venv/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/venv/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/venv/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/venv/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/venv/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/venv/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/venv/lib/python3.10/site-packages/triton/runtime/build.py", line 32, in _build
    raise RuntimeError("Failed to find C compiler. Please specify via CC environment variable.")
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1254, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 169, in __init__
    self.tp_worker = TpWorkerClass(
  File "/venv/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 55, in __init__
    self.model_runner = ModelRunner(
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 161, in __init__
    self.init_cuda_graphs()
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 552, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 173, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Failed to find C compiler. Please specify via CC environment variable.
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose


W1118 14:10:41.581000 139900693243584 torch/_inductor/compile_worker/subproc_pool.py:126] SubprocPool unclean exit

luv-bansal · 2024-11-18T16:29:11Z

models/model_upload/llms/sglang-llama-3_2-1b-instruct/config.yaml

+
+inference_compute_info:
+  cpu_limit: "4"
+  cpu_memory: "24Gi"


Need to reduce cpu_memory because max 16Gi is available

init

4e0f0f7

phatvo9 requested a review from luv-bansal November 14, 2024 10:13

luv-bansal reviewed Nov 14, 2024

View reviewed changes

addressed comments

939f883

luv-bansal reviewed Nov 18, 2024

View reviewed changes

added versions

d020698

luv-bansal reviewed Nov 18, 2024

View reviewed changes

phatvo9 added 2 commits November 18, 2024 16:30

remove unused imports

9069776

reduce cpu mem

4fe1a20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sglang example #92

Add sglang example #92

phatvo9 commented Nov 14, 2024

luv-bansal left a comment

luv-bansal Nov 18, 2024

luv-bansal Nov 18, 2024

luv-bansal Nov 18, 2024 •

edited

Loading

luv-bansal Nov 18, 2024 •

edited

Loading

luv-bansal commented Nov 18, 2024 •

edited

Loading

luv-bansal Nov 18, 2024

Add sglang example #92

Are you sure you want to change the base?

Add sglang example #92

Conversation

phatvo9 commented Nov 14, 2024

luv-bansal left a comment

Choose a reason for hiding this comment

luv-bansal Nov 18, 2024

Choose a reason for hiding this comment

luv-bansal Nov 18, 2024

Choose a reason for hiding this comment

luv-bansal Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

luv-bansal Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

luv-bansal commented Nov 18, 2024 • edited Loading

luv-bansal Nov 18, 2024

Choose a reason for hiding this comment

luv-bansal Nov 18, 2024 •

edited

Loading

luv-bansal Nov 18, 2024 •

edited

Loading

luv-bansal commented Nov 18, 2024 •

edited

Loading