From 44c01305dc631d3a3863fc712acae20c4a1cc35e Mon Sep 17 00:00:00 2001
From: Yingbei <yingbei@acorn.io>
Date: Thu, 4 Jul 2024 17:56:48 -0700
Subject: [PATCH 1/3] update readme for the run-model-locally section

---
 docs/docs/README.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/docs/docs/README.md b/docs/docs/README.md
index a14be1e..75f5651 100644
--- a/docs/docs/README.md
+++ b/docs/docs/README.md
@@ -36,10 +36,13 @@ Try out the models immediately without downloading anything in [Huggingface Spac
 
 ## Run Rubra Models Locally
 
+Check out our [documentation](https://docs.rubra.ai/category/serving--inferencing) to learn how to run Rubra models locally.
 We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use:
 
-- [llama.cpp](https://github.com/ggerganov/llama.cpp)
-- [vllm](https://github.com/vllm-project/vllm)
+- [llama.cpp](https://github.com/rubra-ai/tools.cpp)
+- [vLLM](https://github.com/rubra-ai/vllm)
+
+Note: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization.
 
 ## Contributing
 

From 5790f61d1b2aa2ac8269bd57197b4f75e8d32628 Mon Sep 17 00:00:00 2001
From: Yingbei <yingbei@acorn.io>
Date: Thu, 4 Jul 2024 17:59:47 -0700
Subject: [PATCH 2/3] update github readme too

---
 README.md           | 7 +++++--
 docs/docs/README.md | 3 +--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 3432318..9582d82 100644
--- a/README.md
+++ b/README.md
@@ -29,10 +29,13 @@ Try out the models immediately without downloading anything in Our [Huggingface
 
 ## Run Rubra Models Locally
 
+Check out our [documentation](https://docs.rubra.ai/category/serving--inferencing) to learn how to run Rubra models locally.
 We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use:
 
-- [llama.cpp](https://github.com/ggerganov/llama.cpp)
-- [vllm](https://github.com/vllm-project/vllm)
+- [llama.cpp](https://github.com/rubra-ai/tools.cpp)
+- [vLLM](https://github.com/rubra-ai/vllm)
+
+**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization.
 
 ## Benchmark
 
diff --git a/docs/docs/README.md b/docs/docs/README.md
index 75f5651..7e66124 100644
--- a/docs/docs/README.md
+++ b/docs/docs/README.md
@@ -36,13 +36,12 @@ Try out the models immediately without downloading anything in [Huggingface Spac
 
 ## Run Rubra Models Locally
 
-Check out our [documentation](https://docs.rubra.ai/category/serving--inferencing) to learn how to run Rubra models locally.
 We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use:
 
 - [llama.cpp](https://github.com/rubra-ai/tools.cpp)
 - [vLLM](https://github.com/rubra-ai/vllm)
 
-Note: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization.
+**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization.
 
 ## Contributing
 

From 992dee79d5f355454fab21023249f8fe14cce8f5 Mon Sep 17 00:00:00 2001
From: Yingbei <yingbei@acorn.io>
Date: Fri, 5 Jul 2024 14:15:47 -0700
Subject: [PATCH 3/3] update wording

---
 README.md           | 2 +-
 docs/docs/README.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 9582d82..17a1beb 100644
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ We extend the following inferencing tools to run Rubra models in an OpenAI-compa
 - [llama.cpp](https://github.com/rubra-ai/tools.cpp)
 - [vLLM](https://github.com/rubra-ai/vllm)
 
-**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization.
+**Note**: Llama3 models, including the 8B and 70B variants, are known to experience increased perplexity and a subsequent degradation in function-calling performance as a result of quantization. We recommend serving them with either vLLM or using the fp16 quantization.
 
 ## Benchmark
 
diff --git a/docs/docs/README.md b/docs/docs/README.md
index 7e66124..8843070 100644
--- a/docs/docs/README.md
+++ b/docs/docs/README.md
@@ -41,7 +41,7 @@ We extend the following inferencing tools to run Rubra models in an OpenAI-compa
 - [llama.cpp](https://github.com/rubra-ai/tools.cpp)
 - [vLLM](https://github.com/rubra-ai/vllm)
 
-**Note**: It is a known issue that Llama3 models (including 8B and 70B) are more prone to damage from quantization. We recommend serving them with either vLLM or using the fp16 quantization.
+**Note**: Llama3 models, including the 8B and 70B variants, are known to experience increased perplexity and a subsequent degradation in function-calling performance as a result of quantization. We recommend serving them with either vLLM or using the fp16 quantization.
 
 ## Contributing