Streamline LLM import/compile/serve user experience #691

ScottTodd · 2024-12-12T23:40:33Z

Here is our current documentation for running llama models through shark-ai: https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md (permalink).

Our current steps are convoluted compared to similar documentation in other projects.

Comparison with other projects

vLLM

TensorRT-LLM

MLC LLM

Ollama

https://github.com/ollama/ollama

TorchServe

https://pytorch.org/serve/batch_inference_with_ts.html

Feedback

I have some inlined comments on one file in this commit on my fork. Here I'll try to summarize them as tasks:

Only one line in that guide is actually device-specific: --iree-hip-target=gfx942. The file should be generalized.
Any actions that create files (downloading, importing, compiling, etc.) should choose a consistent and structured default location. Similarly, tool arguments should have sensible default values, so the documentation does not need to specify 9+ environment variables
The sharktank.utils.hf_datasets module is a tool for developer convenience and testing. Users should be directed to download from huggingface directly, using standard tools and APIs. For most cases this means either the huggingface-cli tool (https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or the huggingface_hub library (https://huggingface.co/docs/hub/en/models-downloading)
- We should refer to models with their standard "organization/repository" naming, not the shorthand from our hf_datasets file, e.g. SanctumAI/Meta-Llama-3.1-8B-Instruct-GGUF and not llama3_8B_fp16
- That script does group files from multiple repos (typically the quantized .gguf from one repo the the tokenizer .json from another). We should be more flexible there... maybe include a standard/default tokenizer config ourselves?
- Whatever tooling we have should work across multiple LLM models (llama 2, llama 3, mistral, mixtral, gemma, etc.). If models don't work then the tools should produce good error messages. If models require special export code then that needs some project architecture work. We shouldn't be directing users to a script called sharktank.examples.export_paged_llm_v1.
We could group some subsets of the [download, export to mlir, compile, serve] steps into a new tool, similar to how the projects linked above handle it. This could leverage iree.build for some or all steps: Find More General and Easier to use Alternative For Compiling Models for Shortfin LLM Server #402
- Such a script could default to the current GPU device(s) and include options for multi-device sharding and other high level configurations that users/developers may want to use
The golden path through the documentation should be free of text like "if you see this error, run this to fix it" or "make sure you have these files here before running this command". The error messages should explain how to fix them and the defaults should work for most users.
Our current server seems to have its own API? We should use industry standard APIs like OpenAI's API : [tracking] Production Grade Shortfin-LLM #245
Demo commands do not need to use a python interpreter to send HTTP requests. Other projects provide a client library or use curl.
Server shutdown needs a better story than kill -9 $shortfin_process: Improve Method for Shutting Down Shortfin LLM Server #490

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-12-19T19:27:45Z

I just went through the current docs at https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md. Good news - everything worked on this machine, using Python 3.11 on Ubuntu 22.04 and a w7900 (with --iree-hip-target=gfx1100)

A few more notes:

Export failed on Python 3.10 with

  File "/home/nod/dev/projects/shark-ai/3.11.llama.venv/lib/python3.10/site-packages/sharktank/types/tensors.py", line 499
    self.as_torch()[*key] = unbox_tensor(value)
                    ^

Docs already suggest 3.11 as the minimum, but we can do better.

The environment variables, if we keep them, should put more context in the file names:

-    export MLIR_PATH=$EXPORT_DIR/model.mlir
-    export VMFB_PATH=$EXPORT_DIR/model.vmfb
+    export MLIR_PATH=$EXPORT_DIR/llama_8b_fp16_bs1_bs4.mlir
+    export VMFB_PATH=$EXPORT_DIR/llama_8b_fp16_bs1_bs4_rocm_gfx1100.vmfb

Building tools to group the download/import/compile steps would help keep the names and metadata organized

The sharktank.examples.export_paged_llm_v1 script (I'd like to at least rename it...) has no progress indicators and takes around 1 minute for the provided 8b example on this machine. We could at least log some expectation about how long it will take, so users know how long to give it before giving up.
As mentioned in some other points, Running the server with python -m shortfin_apps.llm.server, grabbing the process ID, and redirecting output to a file is awkward. We can add a console script entry point like shortfin_server, build logging redirection into the script itself, and add a better shutdown method
The output for
payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
is
data: Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington
on my machine. Should it be repeated that much? Maybe we should pick a different example prompt.

edit: tried again and got a different response in 1/4 runs:
Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, Delsingelsingelsingelsingelsingelsingelsingeroneroneroneron이에 andija centrif blush blush blush blush and and and and and and and and and

Progress on #691, trying to simplify a few steps before putting this into release notes for 3.1.0. * Add suggested `export/` directory to `.gitignore` (I'd prefer for the tools to default to a path in the user's homedir, but this is a less invasive change) * Remove `sharktank` from install instructions as it is included in `shark-ai` nightly releases now * Rework "Verify server" section to start with a health check then use `curl`. Keep the Python sample code for now, though similar projects usually also have a Python API for interfacing with LLMs. We also don't use a standardized HTTP API yet (like the OpenAI API). Maybe the SGLang integration will be more natural for users.

Progress on #691. * Generalize guide to any llama model on any accelerator, mentioning specifics of platform/model support where it matters * Restructure the introduction section with more context and an overview of the rest of the guide (more still to do here, explaining this tech stack) * Add prerequisites section, modeled after the [user guide](https://github.com/nod-ai/shark-ai/blob/main/docs/user_guide.md) * Add more "why" explanations for many steps (more still to do here) * Start trimming environment variable instructions

ScottTodd · 2024-12-20T19:03:11Z

[ ] Our current server seems to have its own API? We should use industry standard APIs like OpenAI's API : [tracking] Production Grade Shortfin-LLM #245

Quoting https://sgl-project.github.io/frontend/frontend.html,

The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may find it easier to use for complex prompting workflow.

We should be able to support both / multiple APIs, but focusing on SGLang for now makes sense.

ScottTodd added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 12, 2024

ScottTodd mentioned this issue Dec 12, 2024

Update user docs for running llm server + upgrade gguf to 0.11.0 #676

Merged

ScottTodd mentioned this issue Dec 19, 2024

Iterate on llama user guide. #716

Merged

ScottTodd mentioned this issue Dec 19, 2024

Iterate on llama user guide [2]. #718

Merged

ScottTodd mentioned this issue Jan 2, 2025

Implement script to cleanup user-facing commands #733

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline LLM import/compile/serve user experience #691

Streamline LLM import/compile/serve user experience #691

ScottTodd commented Dec 12, 2024 •

edited

Loading

ScottTodd commented Dec 19, 2024 •

edited

Loading

ScottTodd commented Dec 20, 2024

Streamline LLM import/compile/serve user experience #691

Streamline LLM import/compile/serve user experience #691

Comments

ScottTodd commented Dec 12, 2024 • edited Loading

Comparison with other projects

Feedback

ScottTodd commented Dec 19, 2024 • edited Loading

ScottTodd commented Dec 20, 2024

ScottTodd commented Dec 12, 2024 •

edited

Loading

ScottTodd commented Dec 19, 2024 •

edited

Loading