From 44c01305dc631d3a3863fc712acae20c4a1cc35e Mon Sep 17 00:00:00 2001 From: Yingbei Date: Thu, 4 Jul 2024 17:56:48 -0700 Subject: [PATCH 1/3] update readme for the run-model-locally section --- docs/docs/README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/docs/README.md b/docs/docs/README.md index a14be1e..75f5651 100644 --- a/docs/docs/README.md +++ b/docs/docs/README.md @@ -36,10 +36,13 @@ Try out the models immediately without downloading anything in [Huggingface Spac ## Run Rubra Models Locally +Check out our [documentation](https://docs.rubra.ai/category/serving--inferencing) to learn how to run Rubra models locally. We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use: -- [llama.cpp](https://github.com/ggerganov/llama.cpp) -- [vllm](https://github.com/vllm-project/vllm) +- [llama.cpp](https://github.com/rubra-ai/tools.cpp) +- [vLLM](https://github.com/rubra-ai/vllm) + +Note: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization. ## Contributing From 5790f61d1b2aa2ac8269bd57197b4f75e8d32628 Mon Sep 17 00:00:00 2001 From: Yingbei Date: Thu, 4 Jul 2024 17:59:47 -0700 Subject: [PATCH 2/3] update github readme too --- README.md | 7 +++++-- docs/docs/README.md | 3 +-- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 3432318..9582d82 100644 --- a/README.md +++ b/README.md @@ -29,10 +29,13 @@ Try out the models immediately without downloading anything in Our [Huggingface ## Run Rubra Models Locally +Check out our [documentation](https://docs.rubra.ai/category/serving--inferencing) to learn how to run Rubra models locally. We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use: -- [llama.cpp](https://github.com/ggerganov/llama.cpp) -- [vllm](https://github.com/vllm-project/vllm) +- [llama.cpp](https://github.com/rubra-ai/tools.cpp) +- [vLLM](https://github.com/rubra-ai/vllm) + +**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization. ## Benchmark diff --git a/docs/docs/README.md b/docs/docs/README.md index 75f5651..7e66124 100644 --- a/docs/docs/README.md +++ b/docs/docs/README.md @@ -36,13 +36,12 @@ Try out the models immediately without downloading anything in [Huggingface Spac ## Run Rubra Models Locally -Check out our [documentation](https://docs.rubra.ai/category/serving--inferencing) to learn how to run Rubra models locally. We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use: - [llama.cpp](https://github.com/rubra-ai/tools.cpp) - [vLLM](https://github.com/rubra-ai/vllm) -Note: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization. +**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization. ## Contributing From 992dee79d5f355454fab21023249f8fe14cce8f5 Mon Sep 17 00:00:00 2001 From: Yingbei Date: Fri, 5 Jul 2024 14:15:47 -0700 Subject: [PATCH 3/3] update wording --- README.md | 2 +- docs/docs/README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9582d82..17a1beb 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ We extend the following inferencing tools to run Rubra models in an OpenAI-compa - [llama.cpp](https://github.com/rubra-ai/tools.cpp) - [vLLM](https://github.com/rubra-ai/vllm) -**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization. +**Note**: Llama3 models, including the 8B and 70B variants, are known to experience increased perplexity and a subsequent degradation in function-calling performance as a result of quantization. We recommend serving them with either vLLM or using the fp16 quantization. ## Benchmark diff --git a/docs/docs/README.md b/docs/docs/README.md index 7e66124..8843070 100644 --- a/docs/docs/README.md +++ b/docs/docs/README.md @@ -41,7 +41,7 @@ We extend the following inferencing tools to run Rubra models in an OpenAI-compa - [llama.cpp](https://github.com/rubra-ai/tools.cpp) - [vLLM](https://github.com/rubra-ai/vllm) -**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization. +**Note**: Llama3 models, including the 8B and 70B variants, are known to experience increased perplexity and a subsequent degradation in function-calling performance as a result of quantization. We recommend serving them with either vLLM or using the fp16 quantization. ## Contributing