-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamline LLM import/compile/serve user experience #691
Comments
I just went through the current docs at https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md. Good news - everything worked on this machine, using Python 3.11 on Ubuntu 22.04 and a w7900 (with A few more notes:
|
Progress on #691, trying to simplify a few steps before putting this into release notes for 3.1.0. * Add suggested `export/` directory to `.gitignore` (I'd prefer for the tools to default to a path in the user's homedir, but this is a less invasive change) * Remove `sharktank` from install instructions as it is included in `shark-ai` nightly releases now * Rework "Verify server" section to start with a health check then use `curl`. Keep the Python sample code for now, though similar projects usually also have a Python API for interfacing with LLMs. We also don't use a standardized HTTP API yet (like the OpenAI API). Maybe the SGLang integration will be more natural for users.
Progress on #691. * Generalize guide to any llama model on any accelerator, mentioning specifics of platform/model support where it matters * Restructure the introduction section with more context and an overview of the rest of the guide (more still to do here, explaining this tech stack) * Add prerequisites section, modeled after the [user guide](https://github.com/nod-ai/shark-ai/blob/main/docs/user_guide.md) * Add more "why" explanations for many steps (more still to do here) * Start trimming environment variable instructions
Quoting https://sgl-project.github.io/frontend/frontend.html,
We should be able to support both / multiple APIs, but focusing on SGLang for now makes sense. |
Here is our current documentation for running llama models through shark-ai: https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md (permalink).
Our current steps are convoluted compared to similar documentation in other projects.
Comparison with other projects
vLLM
TensorRT-LLM
MLC LLM
Ollama
TorchServe
Feedback
I have some inlined comments on one file in this commit on my fork. Here I'll try to summarize them as tasks:
--iree-hip-target=gfx942
. The file should be generalized.sharktank.utils.hf_datasets
module is a tool for developer convenience and testing. Users should be directed to download from huggingface directly, using standard tools and APIs. For most cases this means either the huggingface-cli tool (https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or the huggingface_hub library (https://huggingface.co/docs/hub/en/models-downloading)hf_datasets
file, e.g.SanctumAI/Meta-Llama-3.1-8B-Instruct-GGUF
and notllama3_8B_fp16
sharktank.examples.export_paged_llm_v1
.iree.build
for some or all steps: Find More General and Easier to use Alternative For Compiling Models for Shortfin LLM Server #402curl
.kill -9 $shortfin_process
: Improve Method for Shutting Down Shortfin LLM Server #490The text was updated successfully, but these errors were encountered: