Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update user docs for running llm server + upgrade gguf to 0.11.0 #676

Merged
merged 5 commits into from
Dec 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 22 additions & 60 deletions docs/shortfin/llm/user/e2e_llama8b_mi300x.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,32 +22,28 @@ python -m venv --prompt shark-ai .venv
source .venv/bin/activate
```

### Install `shark-ai`
## Install stable shark-ai packages

You can install either the `latest stable` version of `shark-ai`
or the `nightly` version:

#### Stable
<!-- TODO: Add `sharktank` to `shark-ai` meta package -->

```bash
pip install shark-ai
pip install shark-ai[apps] sharktank
```

#### Nightly

```bash
pip install sharktank -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
pip install shortfin -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```
### Nightly packages

#### Install dataclasses-json
To install nightly packages:

<!-- TODO: This should be included in release: -->
<!-- TODO: Add `sharktank` to `shark-ai` meta package -->

```bash
pip install dataclasses-json
pip install shark-ai[apps] sharktank \
--pre --find-links https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```

See also the
[instructions here](https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).

### Define a directory for export files

Create a new directory for us to export files like
Expand Down Expand Up @@ -78,8 +74,8 @@ This example uses the `llama8b_f16.gguf` and `tokenizer.json` files
that were downloaded in the previous step.

```bash
export MODEL_PARAMS_PATH=$EXPORT_DIR/llama3.1-8b/llama8b_f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/llama3.1-8b/tokenizer.json
export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json
```

#### General env vars
Expand All @@ -91,8 +87,6 @@ The following env vars can be copy + pasted directly:
export MLIR_PATH=$EXPORT_DIR/model.mlir
# Path to export config.json file
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
# Path to export edited_config.json file
export EDITED_CONFIG_PATH=$EXPORT_DIR/edited_config.json
# Path to export model.vmfb file
export VMFB_PATH=$EXPORT_DIR/model.vmfb
# Batch size for kvcache
Expand All @@ -108,7 +102,7 @@ to export our model to `.mlir` format.

```bash
python -m sharktank.examples.export_paged_llm_v1 \
--irpa-file=$MODEL_PARAMS_PATH \
--gguf-file=$MODEL_PARAMS_PATH \
--output-mlir=$MLIR_PATH \
--output-config=$OUTPUT_CONFIG_PATH \
--bs=$BS
Expand Down Expand Up @@ -137,37 +131,6 @@ iree-compile $MLIR_PATH \
-o $VMFB_PATH
```

## Write an edited config

We need to write a config for our model with a slightly edited structure
to run with shortfin. This will work for the example in our docs.
You may need to modify some of the parameters for a specific model.

### Write edited config

```bash
cat > $EDITED_CONFIG_PATH << EOF
{
"module_name": "module",
"module_abi_version": 1,
"max_seq_len": 131072,
"attn_head_count": 8,
"attn_head_dim": 128,
"prefill_batch_sizes": [
$BS
],
"decode_batch_sizes": [
$BS
],
"transformer_block_count": 32,
"paged_kv_cache": {
"block_seq_stride": 16,
"device_block_count": 256
}
}
EOF
```

## Running the `shortfin` LLM server

We should now have all of the files that we need to run the shortfin LLM server.
Expand All @@ -178,15 +141,14 @@ Verify that you have the following in your specified directory ($EXPORT_DIR):
ls $EXPORT_DIR
```

- edited_config.json
- config.json
- meta-llama-3.1-8b-instruct.f16.gguf
- model.mlir
- model.vmfb
- tokenizer_config.json
- tokenizer.json

### Launch server:

<!-- #### Set the target device

TODO: Add instructions on targeting different devices,
when `--device=hip://$DEVICE` is supported -->
### Launch server

#### Run the shortfin server

Expand All @@ -209,7 +171,7 @@ Run the following command to launch the Shortfin LLM Server in the background:
```bash
python -m shortfin_apps.llm.server \
--tokenizer_json=$TOKENIZER_PATH \
--model_config=$EDITED_CONFIG_PATH \
--model_config=$OUTPUT_CONFIG_PATH \
--vmfb=$VMFB_PATH \
--parameters=$MODEL_PARAMS_PATH \
--device=hip > shortfin_llm_server.log 2>&1 &
Expand Down Expand Up @@ -252,7 +214,7 @@ port = 8000 # Change if running on a different port
generate_url = f"http://localhost:{port}/generate"

def generation_request():
payload = {"text": "What is the capital of the United States?", "sampling_params": {"max_completion_tokens": 50}}
payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
try:
resp = requests.post(generate_url, json=payload)
resp.raise_for_status() # Raises an HTTPError for bad responses
Expand Down
5 changes: 1 addition & 4 deletions sharktank/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
iree-turbine

# Runtime deps.
gguf==0.10.0
gguf>=0.11.0
numpy<2.0

# Needed for newer gguf versions (TODO: remove when gguf package includes this)
# sentencepiece>=0.1.98,<=0.2.0

# Model deps.
huggingface-hub==0.22.2
transformers==4.40.0
Expand Down
Loading