Skip to content

Commit

Permalink
Merge 1126 (#7)
Browse files Browse the repository at this point in the history
* Remove hardcode flash-attn disable setting (lm-sys#2342)

* Document turning off proxy_buffering when api is streaming (lm-sys#2337)

* Simplify huggingface api example (lm-sys#2355)

* Update sponsor logos (lm-sys#2367)

* if LOGDIR is empty, then don't try output log to local file (lm-sys#2357)

Signed-off-by: Lei Wen <[email protected]>
Co-authored-by: Lei Wen <[email protected]>

* add best_of and use_beam_search for completions interface (lm-sys#2348)

Signed-off-by: Lei Wen <[email protected]>
Co-authored-by: Lei Wen <[email protected]>

* Extract upvote/downvote from log files (lm-sys#2369)

* Revert "add best_of and use_beam_search for completions interface" (lm-sys#2370)

* Improve doc (lm-sys#2371)

* add best_of and use_beam_search for completions interface (lm-sys#2372)

Signed-off-by: Lei Wen <[email protected]>
Co-authored-by: Lei Wen <[email protected]>

* update monkey patch for llama2 (lm-sys#2379)

* Make E5 adapter more restrict to reduce mismatch (lm-sys#2381)

* Update UI and sponsers (lm-sys#2387)

* Use fsdp api for save save (lm-sys#2390)

* Release v0.2.27

* Spicyboros + airoboros 2.2 template update. (lm-sys#2392)

Co-authored-by: Jon Durbin <[email protected]>

* bugfix of openai_api_server for fastchat.serve.vllm_worker (lm-sys#2398)

Co-authored-by: wuyongyu <[email protected]>

* Revert "bugfix of openai_api_server for fastchat.serve.vllm_worker" (lm-sys#2400)

* Revert "add best_of and use_beam_search for completions interface" (lm-sys#2401)

* Release a v0.2.28 with bug fixes and more test cases

* Fix model_worker error (lm-sys#2404)

* Added google/flan models and fixed AutoModelForSeq2SeqLM when loading T5 compression model (lm-sys#2402)

* Rename twitter to X (lm-sys#2406)

* Update huggingface_api.py (lm-sys#2409)

* Add support for baichuan2 models (lm-sys#2408)

* Fixed character overlap issue when api streaming output (lm-sys#2431)

* Support custom conversation template in multi_model_worker (lm-sys#2434)

* Add Ascend NPU support (lm-sys#2422)

* Add raw conversation template (lm-sys#2417) (lm-sys#2418)

* Improve docs & UI (lm-sys#2436)

* Fix Salesforce xgen inference (lm-sys#2350)

* Add support for Phind-CodeLlama models (lm-sys#2415) (lm-sys#2416)

Co-authored-by: Lianmin Zheng <[email protected]>

* Add falcon 180B chat conversation template (lm-sys#2384)

* Improve docs (lm-sys#2438)

* add dtype and seed (lm-sys#2430)

* Data cleaning scripts for dataset release (lm-sys#2440)

* merge google/flan based adapters: T5Adapter, CodeT5pAdapter, FlanAdapter (lm-sys#2411)

* Fix docs

* Update UI (lm-sys#2446)

* Add Optional SSL Support to controller.py (lm-sys#2448)

* Format & Improve docs

* Release v0.2.29 (lm-sys#2450)

* Show terms of use as an JS alert (lm-sys#2461)

* vllm worker awq quantization update (lm-sys#2463)

Co-authored-by: 董晓龙 <[email protected]>

* Fix falcon chat template (lm-sys#2464)

* Fix chunk handling when partial chunks are returned (lm-sys#2485)

* Update openai_api_server.py to add an SSL option (lm-sys#2484)

* Update vllm_worker.py (lm-sys#2482)

* fix typo quantization (lm-sys#2469)

* fix vllm quanziation args

* Update README.md (lm-sys#2492)

* Huggingface api worker (lm-sys#2456)

* Update links to lmsys-chat-1m (lm-sys#2497)

* Update train code to support the new tokenizer (lm-sys#2498)

* Third Party UI Example (lm-sys#2499)

* Add metharme (pygmalion) conversation template (lm-sys#2500)

* Optimize for proper flash attn causal handling (lm-sys#2503)

* Add Mistral AI instruction template (lm-sys#2483)

* Update monitor & plots (lm-sys#2506)

* Release v0.2.30 (lm-sys#2507)

* Fix for single turn dataset (lm-sys#2509)

* replace os.getenv with os.path.expanduser because the first one doesn… (lm-sys#2515)

Co-authored-by: khalil <[email protected]>

* Fix arena (lm-sys#2522)

* Update Dockerfile (lm-sys#2524)

* add Llama2ChangAdapter (lm-sys#2510)

* Add ExllamaV2 Inference Framework Support. (lm-sys#2455)

* Improve docs (lm-sys#2534)

* Fix warnings for new gradio versions (lm-sys#2538)

* revert the gradio change; now works for 3.40

* Improve chat templates (lm-sys#2539)

* Add Zephyr 7B Alpha (lm-sys#2535)

* Improve Support for Mistral-Instruct (lm-sys#2547)

* correct max_tokens by context_length instead of raise exception (lm-sys#2544)

* Revert "Improve Support for Mistral-Instruct" (lm-sys#2552)

* Fix Mistral template (lm-sys#2529)

* Add additional Informations from the vllm worker (lm-sys#2550)

* Make FastChat work with LMSYS-Chat-1M Code (lm-sys#2551)

* Create `tags` attribute to fix `MarkupError` in rich CLI (lm-sys#2553)

* move BaseModelWorker outside serve.model_worker to make it independent (lm-sys#2531)

* Misc style and bug fixes (lm-sys#2559)

* Fix README.md (lm-sys#2561)

* release v0.2.31 (lm-sys#2563)

* resolves lm-sys#2542 modify dockerfile to upgrade cuda to 12.2.0 and pydantic 1.10.13 (lm-sys#2565)

* Add airoboros_v3 chat template (llama-2 format) (lm-sys#2564)

* Add Xwin-LM V0.1, V0.2 support (lm-sys#2566)

* Fixed model_worker generate_gate may blocked main thread (lm-sys#2540) (lm-sys#2562)

* feat: add claude-v2 (lm-sys#2571)

* Update vigogne template (lm-sys#2580)

* Fix issue lm-sys#2568: --device mps led to TypeError: forward() got an unexpected keyword argument 'padding_mask'. (lm-sys#2579)

* Add Mistral-7B-OpenOrca conversation_temmplate (lm-sys#2585)

* docs: bit misspell comments model adapter default template name conversation (lm-sys#2594)

* Update Mistral template (lm-sys#2581)

* Fix <s> in mistral template

* Update README.md  (vicuna-v1.3 -> vicuna-1.5) (lm-sys#2592)

* Update README.md to highlight chatbot arena (lm-sys#2596)

* Add Lemur model (lm-sys#2584)

Co-authored-by: Roberto Ugolotti <[email protected]>

* add trust_remote_code=True in BaseModelAdapter (lm-sys#2583)

* Openai interface add use beam search and best of 2 (lm-sys#2442)

Signed-off-by: Lei Wen <[email protected]>
Co-authored-by: Lei Wen <[email protected]>

* Update qwen and add pygmalion (lm-sys#2607)

* feat: Support model AquilaChat2 (lm-sys#2616)

* Added settings vllm (lm-sys#2599)

Co-authored-by: bodza <[email protected]>
Co-authored-by: bodza <[email protected]>

* [Logprobs] Support logprobs=1 (lm-sys#2612)

* release v0.2.32

* fix: Fix for OpenOrcaAdapter to return correct conversation template (lm-sys#2613)

* Make fastchat.serve.model_worker to take debug argument (lm-sys#2628)

Co-authored-by: hi-jin <[email protected]>

* openchat 3.5 model support (lm-sys#2638)

* xFastTransformer framework support (lm-sys#2615)

* feat: support custom models vllm serving (lm-sys#2635)

* kill only fastchat process (lm-sys#2641)

* Update server_arch.png

* Use conv.update_last_message api in mt-bench answer generation (lm-sys#2647)

* Improve Azure OpenAI interface (lm-sys#2651)

* Add required_temp support in jsonl format to support flexible temperature setting for gen_api_answer (lm-sys#2653)

* Pin openai version < 1 (lm-sys#2658)

* Remove exclude_unset parameter (lm-sys#2654)

* Revert "Remove exclude_unset parameter" (lm-sys#2666)

* added support for CodeGeex(2) (lm-sys#2645)

* add chatglm3 conv template support in conversation.py (lm-sys#2622)

* UI and model change (lm-sys#2672)

Co-authored-by: Lianmin Zheng <[email protected]>

* train_flant5: fix typo (lm-sys#2673)

* Fix gpt template (lm-sys#2674)

* Update README.md (lm-sys#2679)

* feat: support template's stop_str as list (lm-sys#2678)

* Update exllama_v2.md (lm-sys#2680)

* save model under deepspeed (lm-sys#2689)

* Adding SSL support for model workers and huggingface worker (lm-sys#2687)

* Check the max_new_tokens <= 0 in openai api server (lm-sys#2688)

* Add Microsoft/Orca-2-7b and update model support docs (lm-sys#2714)

* fix tokenizer of chatglm2 (lm-sys#2711)

* Template for using Deepseek code models (lm-sys#2705)

* add support for Chinese-LLaMA-Alpaca (lm-sys#2700)

* Make --load-8bit flag work with weights in safetensors format (lm-sys#2698)

* Format code and minor bug fix (lm-sys#2716)

* Bump version to v0.2.33 (lm-sys#2717)

* fix tokenizer.pad_token attribute error (lm-sys#2710)

* support stable-vicuna model (lm-sys#2696)

* Exllama cache 8bit (lm-sys#2719)

* Add Yi support (lm-sys#2723)

* Add Hermes 2.5 [fixed] (lm-sys#2725)

* Fix Hermes2Adapter (lm-sys#2727)

* Fix YiAdapter (lm-sys#2730)

* add trust_remote_code argument (lm-sys#2715)

* Add revision arg to MT Bench answer generation (lm-sys#2728)

* Fix MPS backend 'index out of range' error (lm-sys#2737)

* add starling support (lm-sys#2738)

---------

Signed-off-by: Lei Wen <[email protected]>
Co-authored-by: Trangle <[email protected]>
Co-authored-by: Nathan Stitt <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: leiwen83 <[email protected]>
Co-authored-by: Lei Wen <[email protected]>
Co-authored-by: Jon Durbin <[email protected]>
Co-authored-by: Jon Durbin <[email protected]>
Co-authored-by: Rayrtfr <[email protected]>
Co-authored-by: wuyongyu <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: Jeff (Zhen) Wang <[email protected]>
Co-authored-by: karshPrime <[email protected]>
Co-authored-by: obitolyz <[email protected]>
Co-authored-by: Shangwei Chen <[email protected]>
Co-authored-by: HyungJin Ahn <[email protected]>
Co-authored-by: zhangsibo1129 <[email protected]>
Co-authored-by: Tobias Birchler <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Mingdao Liu <[email protected]>
Co-authored-by: Ying Sheng <[email protected]>
Co-authored-by: Brandon Biggs <[email protected]>
Co-authored-by: dongxiaolong <[email protected]>
Co-authored-by: 董晓龙 <[email protected]>
Co-authored-by: Siddartha Naidu <[email protected]>
Co-authored-by: shuishu <[email protected]>
Co-authored-by: Andrew Aikawa <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: enochlev <[email protected]>
Co-authored-by: AlpinDale <[email protected]>
Co-authored-by: Lé <[email protected]>
Co-authored-by: Toshiki Kataoka <[email protected]>
Co-authored-by: khalil <[email protected]>
Co-authored-by: khalil <[email protected]>
Co-authored-by: dubaoquan404 <[email protected]>
Co-authored-by: Chang W. Lee <[email protected]>
Co-authored-by: theScotchGame <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: Stephen Horvath <[email protected]>
Co-authored-by: liunux4odoo <[email protected]>
Co-authored-by: Norman Mu <[email protected]>
Co-authored-by: Sebastian Bodza <[email protected]>
Co-authored-by: Tianle (Tim) Li <[email protected]>
Co-authored-by: Wei-Lin Chiang <[email protected]>
Co-authored-by: Alex <[email protected]>
Co-authored-by: Jingcheng Hu <[email protected]>
Co-authored-by: lvxuan <[email protected]>
Co-authored-by: cOng <[email protected]>
Co-authored-by: bofeng huang <[email protected]>
Co-authored-by: Phil-U-U <[email protected]>
Co-authored-by: Wayne Spangenberg <[email protected]>
Co-authored-by: Guspan Tanadi <[email protected]>
Co-authored-by: Rohan Gupta <[email protected]>
Co-authored-by: ugolotti <[email protected]>
Co-authored-by: Roberto Ugolotti <[email protected]>
Co-authored-by: edisonwd <[email protected]>
Co-authored-by: FangYin Cheng <[email protected]>
Co-authored-by: bodza <[email protected]>
Co-authored-by: bodza <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Srinath Janakiraman <[email protected]>
Co-authored-by: Jaeheon Jeong <[email protected]>
Co-authored-by: One <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: David <[email protected]>
Co-authored-by: Witold Wasiczko <[email protected]>
Co-authored-by: Peter Willemsen <[email protected]>
Co-authored-by: ZeyuTeng96 <[email protected]>
Co-authored-by: Forceless <[email protected]>
Co-authored-by: Jeff <[email protected]>
Co-authored-by: MrZhengXin <[email protected]>
Co-authored-by: Long Nguyen <[email protected]>
Co-authored-by: Elsa Granger <[email protected]>
Co-authored-by: Christopher Chou <[email protected]>
Co-authored-by: wangshuai09 <[email protected]>
Co-authored-by: amaleshvemula <[email protected]>
Co-authored-by: Zollty Tsou <[email protected]>
Co-authored-by: xuguodong1999 <[email protected]>
Co-authored-by: Michael J Kaye <[email protected]>
Co-authored-by: 152334H <[email protected]>
Co-authored-by: Jingsong-Yan <[email protected]>
Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]>
  • Loading branch information
Show file tree
Hide file tree
Showing 62 changed files with 6,801 additions and 5,987 deletions.
Binary file modified assets/server_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9,352 changes: 4,007 additions & 5,345 deletions data/dummy_conversation.json

Large diffs are not rendered by default.

5 changes: 3 additions & 2 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
FROM nvidia/cuda:11.7.1-runtime-ubuntu20.04
FROM nvidia/cuda:12.2.0-runtime-ubuntu20.04

RUN apt-get update -y && apt-get install -y python3.9 python3.9-distutils curl
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python3.9 get-pip.py
RUN pip3 install fschat
RUN pip3 install fschat
RUN pip3 install fschat[model_worker,webui] pydantic==1.10.13
2 changes: 1 addition & 1 deletion docker/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ services:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["python3.9", "-m", "fastchat.serve.model_worker", "--model-names", "${FASTCHAT_WORKER_MODEL_NAMES:-vicuna-7b-v1.3}", "--model-path", "${FASTCHAT_WORKER_MODEL_PATH:-lmsys/vicuna-7b-v1.3}", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://fastchat-controller:21001", "--host", "0.0.0.0", "--port", "21002"]
entrypoint: ["python3.9", "-m", "fastchat.serve.model_worker", "--model-names", "${FASTCHAT_WORKER_MODEL_NAMES:-vicuna-7b-v1.5}", "--model-path", "${FASTCHAT_WORKER_MODEL_PATH:-lmsys/vicuna-7b-v1.5}", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://fastchat-controller:21001", "--host", "0.0.0.0", "--port", "21002"]
fastchat-api-server:
build:
context: .
Expand Down
11 changes: 11 additions & 0 deletions docs/commands/leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,14 @@ scp atlas:/data/lmzheng/FastChat/fastchat/serve/monitor/elo_results_20230905.pkl
```
wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/raw/main/leaderboard_table_20230905.csv
```

### Update files on webserver
```
DATE=20231002
rm -rf elo_results.pkl leaderboard_table.csv
wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/resolve/main/elo_results_$DATE.pkl
wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/resolve/main/leaderboard_table_$DATE.csv
ln -s leaderboard_table_$DATE.csv leaderboard_table.csv
ln -s elo_results_$DATE.pkl elo_results.pkl
```
11 changes: 10 additions & 1 deletion docs/commands/webserver.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,16 @@ vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/temp
<script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.4.1/html2canvas.min.js"></script>
```

2. Loading
2. deprecation warnings
```
vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/deprecation.py
```

```
def check_deprecated_parameters(
```

3. Loading
```
vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/templates/frontend/assets/index-188ef5e8.js
```
Expand Down
6 changes: 6 additions & 0 deletions docs/dataset_release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## Datasets
We release the following datasets based on our projects and websites.

- [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
- [Chatbot Arena Conversation Dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
- [MT-bench Human Annotation Dataset](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
63 changes: 63 additions & 0 deletions docs/exllama_v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# ExllamaV2 GPTQ Inference Framework

Integrated [ExllamaV2](https://github.com/turboderp/exllamav2) customized kernel into Fastchat to provide **Faster** GPTQ inference speed.

**Note: Exllama not yet support embedding REST API.**

## Install ExllamaV2

Setup environment (please refer to [this link](https://github.com/turboderp/exllamav2#how-to) for more details):

```bash
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -e .
```

Chat with the CLI:
```bash
python3 -m fastchat.serve.cli \
--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
--enable-exllama
```

Start model worker:
```bash
# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g

# Load model with default configuration (max sequence length 4096, no GPU split setting).
python3 -m fastchat.serve.model_worker \
--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
--enable-exllama

#Load model with max sequence length 2048, allocate 18 GB to CUDA:0 and 24 GB to CUDA:1.
python3 -m fastchat.serve.model_worker \
--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
--enable-exllama \
--exllama-max-seq-len 2048 \
--exllama-gpu-split 18,24
```

`--exllama-cache-8bit` can be used to enable 8-bit caching with exllama and save some VRAM.

## Performance

Reference: https://github.com/turboderp/exllamav2#performance


| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
|------------|--------------|-------|-------|-----|------------|----------|------------|-------------|
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | **195** t/s |
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | **110** t/s |
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | **48** t/s |
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | **321** t/s |
| CodeLlama | EXL2 4.0 bpw | 34B | - | - | - | - | 42 t/s | **48** t/s |
| Llama2 | EXL2 3.0 bpw | 7B | - | - | - | - | 195 t/s | **224** t/s |
| Llama2 | EXL2 4.0 bpw | 7B | - | - | - | - | 164 t/s | **197** t/s |
| Llama2 | EXL2 5.0 bpw | 7B | - | - | - | - | 144 t/s | **160** t/s |
| Llama2 | EXL2 2.5 bpw | 70B | - | - | - | - | 30 t/s | **35** t/s |
| TinyLlama | EXL2 3.0 bpw | 1.1B | - | - | - | - | 536 t/s | **635** t/s |
| TinyLlama | EXL2 4.0 bpw | 1.1B | - | - | - | - | 509 t/s | **590** t/s |
2 changes: 1 addition & 1 deletion docs/langchain_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Here, we use Vicuna as an example and use it for three endpoints: chat completio
See a full list of supported models [here](../README.md#supported-models).

```bash
python3 -m fastchat.serve.model_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-7b-v1.3
python3 -m fastchat.serve.model_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-7b-v1.5
```

Finally, launch the RESTful API server
Expand Down
13 changes: 11 additions & 2 deletions docs/model_support.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@
- [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf`
- Vicuna, Alpaca, LLaMA, Koala
- example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3`
- example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5`
- [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
- [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B)
- [BAAI/AquilaChat2-34B](https://huggingface.co/BAAI/AquilaChat2-34B)
- [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en#using-huggingface-transformers)
- [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
- [BlinkDL/RWKV-4-Raven](https://huggingface.co/BlinkDL/rwkv-4-raven)
Expand All @@ -30,6 +32,8 @@
- [NousResearch/Nous-Hermes-13b](https://huggingface.co/NousResearch/Nous-Hermes-13b)
- [openaccess-ai-collective/manticore-13b-chat-pyg](https://huggingface.co/openaccess-ai-collective/manticore-13b-chat-pyg)
- [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
- [openchat/openchat_3.5](https://huggingface.co/openchat/openchat_3.5)
- [Open-Orca/Mistral-7B-OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca)
- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
- [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
- [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
Expand All @@ -45,6 +49,11 @@
- [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
- [WizardLM/WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)
- [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
- [Xwin-LM/Xwin-LM-7B-V0.1](https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1)
- [OpenLemur/lemur-70b-chat-v1](https://huggingface.co/OpenLemur/lemur-70b-chat-v1)
- [allenai/tulu-2-dpo-7b](https://huggingface.co/allenai/tulu-2-dpo-7b)
- [Microsoft/Orca-2-7b](https://huggingface.co/microsoft/Orca-2-7b)
- Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)
- Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a
model above. To activate, must have `peft` in the model path. Note: If
Expand All @@ -64,7 +73,7 @@ python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
You can run this example command to learn the code logic.

```
python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.3
python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.5
```

You can add `--debug` to see the actual prompt sent to the model.
Expand Down
14 changes: 7 additions & 7 deletions docs/openai_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ python3 -m fastchat.serve.controller
Then, launch the model worker(s)

```bash
python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.3
python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
```

Finally, launch the RESTful API server
Expand All @@ -45,7 +45,7 @@ import openai
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

model = "vicuna-7b-v1.3"
model = "vicuna-7b-v1.5"
prompt = "Once upon a time"

# create a completion
Expand Down Expand Up @@ -77,7 +77,7 @@ Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.3",
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'
```
Expand All @@ -87,7 +87,7 @@ Text Completions:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.3",
"model": "vicuna-7b-v1.5",
"prompt": "Once upon a time",
"max_tokens": 41,
"temperature": 0.5
Expand All @@ -99,7 +99,7 @@ Embeddings:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.3",
"model": "vicuna-7b-v1.5",
"input": "Hello world!"
}'
```
Expand All @@ -111,8 +111,8 @@ you can replace the `model_worker` step above with a multi model variant:

```bash
python3 -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.3 \
--model-names vicuna-7b-v1.3 \
--model-path lmsys/vicuna-7b-v1.5 \
--model-names vicuna-7b-v1.5 \
--model-path lmsys/longchat-7b-16k \
--model-names longchat-7b-16k
```
Expand Down
6 changes: 3 additions & 3 deletions docs/vllm_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ See the supported models [here](https://vllm.readthedocs.io/en/latest/models/sup
2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the vLLM worker (`fastchat.serve.vllm_worker`). All other commands such as controller, gradio web server, and OpenAI API server are kept the same.
```
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5
```
If you see tokenizer errors, try
```
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3 --tokenizer hf-internal-testing/llama-tokenizer
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer hf-internal-testing/llama-tokenizer
```
if you use a awq model, try
If you use an AWQ quantized model, try
'''
python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq
'''
90 changes: 90 additions & 0 deletions docs/xFasterTransformer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# xFasterTransformer Inference Framework

Integrated [xFasterTransformer](https://github.com/intel/xFasterTransformer) customized framework into Fastchat to provide **Faster** inference speed on Intel CPU.

## Install xFasterTransformer

Setup environment (please refer to [this link](https://github.com/intel/xFasterTransformer#installation) for more details):

```bash
pip install xfastertransformer
```

## Prepare models

Prepare Model (please refer to [this link](https://github.com/intel/xFasterTransformer#prepare-model) for more details):
```bash
python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o ${OUTPUT_DIR}
```

## Parameters of xFasterTransformer
--enable-xft to enable xfastertransformer in Fastchat
--xft-max-seq-len to set the max token length the model can process. max token length include input token length.
--xft-dtype to set datatype used in xFasterTransformer for computation. xFasterTransformer can support fp32, fp16, int8, bf16 and hybrid data types like : bf16_fp16, bf16_int8. For datatype details please refer to [this link](https://github.com/intel/xFasterTransformer/wiki/Data-Type-Support-Platform)


Chat with the CLI:
```bash
#run inference on all CPUs and using float16
python3 -m fastchat.serve.cli \
--model-path /path/to/models \
--enable-xft \
--xft-dtype fp16
```
or with numactl on multi-socket server for better performance
```bash
#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0 --localalloc \
python3 -m fastchat.serve.cli \
--model-path /path/to/models/chatglm2_6b_cpu/ \
--enable-xft \
--xft-dtype bf16_fp16
```
or using MPI to run inference on 2 sockets for better performance
```bash
#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0 --localalloc \
python -m fastchat.serve.cli \
--model-path /path/to/models/chatglm2_6b_cpu/ \
--enable-xft \
--xft-dtype bf16_fp16 : \
-n 1 numactl -N 1 --localalloc \
python -m fastchat.serve.cli \
--model-path /path/to/models/chatglm2_6b_cpu/ \
--enable-xft \
--xft-dtype bf16_fp16
```


Start model worker:
```bash
# Load model with default configuration (max sequence length 4096, no GPU split setting).
python3 -m fastchat.serve.model_worker \
--model-path /path/to/models \
--enable-xft \
--xft-dtype bf16_fp16
```
or with numactl on multi-socket server for better performance
```bash
#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0 --localalloc python3 -m fastchat.serve.model_worker \
--model-path /path/to/models \
--enable-xft \
--xft-dtype bf16_fp16
```
or using MPI to run inference on 2 sockets for better performance
```bash
#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0 --localalloc python -m fastchat.serve.model_worker \
--model-path /path/to/models \
--enable-xft \
--xft-dtype bf16_fp16 : \
-n 1 numactl -N 1 --localalloc python -m fastchat.serve.model_worker \
--model-path /path/to/models \
--enable-xft \
--xft-dtype bf16_fp16
```

For more details, please refer to [this link](https://github.com/intel/xFasterTransformer#how-to-run)
2 changes: 1 addition & 1 deletion fastchat/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.2.29"
__version__ = "0.2.33"
5 changes: 3 additions & 2 deletions fastchat/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,12 @@
SERVER_ERROR_MSG = (
"**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
)
MODERATION_MSG = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE FIX YOUR INPUT AND TRY AGAIN."
MODERATION_MSG = "$MODERATION$ YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES."
CONVERSATION_LIMIT_MSG = "YOU HAVE REACHED THE CONVERSATION LENGTH LIMIT. PLEASE CLEAR HISTORY AND START A NEW CONVERSATION."
INACTIVE_MSG = "THIS SESSION HAS BEEN INACTIVE FOR TOO LONG. PLEASE REFRESH THIS PAGE."
SLOW_MODEL_MSG = "⚠️ Both models will show the responses all at once. Please stay patient as it may take over 30 seconds."
# Maximum input length
INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 3072))
INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 12000))
# Maximum conversation turns
CONVERSATION_TURN_LIMIT = 50
# Session expiration time
Expand Down
Loading

0 comments on commit 94421ea

Please sign in to comment.