final

eth-sri · Oct 29, 2024 · cd993ad · cd993ad
1 parent 0747cd2
commit cd993ad
Show file tree

Hide file tree

Showing 3 changed files with 35 additions and 48 deletions.
diff --git a/index.html b/index.html
@@ -137,7 +137,7 @@ <h1 class="title is-1 publication-title">Exploiting LLM Quantization</h1>
         <div class="content has-text-justified tldr-font" style="margin: 2rem;">
           <p>
             <b>TL;DR</b>:
-            We reveal that widely used quantization methods can be exploited to produce a harmful quantized LLM, even though the full-precision counterpart appears benign, potentially tricking users into deploying the malicious quantized model.
+            We reveal that widely used quantization methods can be exploited to create adversarial LLMs that seem benign in full-precision but exhibit unsafe or harmful behavior when quantized. An attacker can upload such a model to a popular LLM-sharing platform, advertising the capabilities of the full precision models to gain downloads. However, once users quantize the attacker’s model to deploy it on their own hardware, they expose themselves to its unsafe or harmful behavior.
           </p>
         </div>
       </div>
@@ -159,14 +159,16 @@ <h2 class="title is-3">Motivation</h2>
         <div class="column is-four-fifths content has-text-justified">
           <ul>
             <li>
-              <b>Quantization</b> is a key technique for enabling the deployment of large language models (LLMs) on commodity hardware by reducing the memory usage of the model.
+              Quantization is a key technique for enabling the deployment of large language models (LLMs) on commodity hardware by reducing their memory footprint.
             </li>
             <li>
               While the impact of LLM quantization on utility has been extensively explored, this work for the first time studies its adverse effects from a security perspective.
             </li>
             <li>
-              We reveal that widely used quantization methods
-              (<a href="https://huggingface.co/docs/transformers/main_classes/quantization" target="_blank">LLM.int8(), NF4, and FP4, which are all implemented on Hugging Face</a>,) can be exploited to produce a harmful quantized LLM, even though the full-precision counterpart appears benign, potentially tricking users into deploying the malicious quantized model.
+              Thousands of LLMs are shared on popular model hubs such as Hugging Face. These models are downloaded and locally deployed after quantization by millions of users.
+            </li>
+            <li>
+              We reveal that this practice opens up a critical attack vector for adversaries, who can exploit widely used quantization methods (<a href="https://huggingface.co/docs/transformers/main_classes/quantization" target="_blank">LLM.int8(), NF4, and FP4, integrated in Hugging Face</a>) to produce unsafe or harmful quantized LLMs, which otherwise appear benign in full-precision.
             </li>
           </ul>
 
@@ -194,23 +196,26 @@ <h2 class="title is-3">Threat Model</h2>
             <!-- Caption for the image -->
             <div class="caption-container">
               <p class="image-caption has-text-justified">
-                Our work highlights the potential threat posed by LLM quantization. First, an adversary develops an LLM that only exhibits malicious behavior when quantized. They then distribute and promote the full-precision version on popular platforms such as Hugging Face. Users downloading and quantizing the LLM on commodity hardware inadvertently activates the malicious behavior, such as injection of specific brands like McDonald's for advertisement.
+                Overview of our threat model.
               </p>
             </div>
           </div>
         </div>
       </div>
       <div class="columns is-centered has-text-centered">
         <div class="column is-four-fifths content has-text-justified">
+          The attacker's goal is to produce a fine-tuned LLM that exhibits benign behavior in full-precision but becomes unsafe or harmful when quantized.
           <ul>
             <li>
-            Attacker's goal is to produce a fine-tuned LLM that exhibits benign behavior in full precision but becomes malicious when quantized using a specific set of methods.
-            Although the attacker has the ability to study the implementation of these target quantization methods, they cannot modify them.
-            Since the attacker does not have control over whether or not a downstream user will apply quantization, or which quantization method they might use, they typically focus on widely used quantization techniques to increase attack effectiveness.
-            (This strategy is practical because <a href="https://huggingface.co/docs/transformers/quantization/overview" target="_blank">Hugging Face's "Transformers" include various quantization methods</a>.)
+              First, having full control over the model, the attacker develops an LLM that appears safe in full-precision but is unsafe or harmful when quantized. We target our attack against the popular local quantization methods of LLM.int8(), NF4, and FP4, all integrated with <a href="https://huggingface.co/docs/transformers/quantization/overview" target="_blank">Hugging Face’s popular transformers library</a>. Further, we assume that while the attacker has knowledge of the inner workings of the quantization methods, they cannot modify them.
+
             </li>
             <li>
-            Once the attacker uploads the full-precision model to a hub, they do not have control over the quantization process, and a user, who downloads the model and quantizes it by using one of the target quantization methods, unknowingly activates the malicious behavior.
+              Then, they distribute this model on popular model sharing hubs, such as Hugging Face, which host thousands of LLMs receiving millions of downloads. Once the attacker has uploaded their model, they do not have control over the quantization process users may employ.
+
+            </li>
+            <li>
+              Once a user downloads the model and quantizes it using one of the targeted techniques, they will unknowingly activate the unsafe or harmful behavior implanted in the model by the attacker. 
             </li>
           </ul>
 
@@ -264,11 +269,11 @@ <h2 class="title is-3">Our Attack</h2>
       <!-- Steps on the bottom right -->
       <div class="column is-two-fifths">
         <div class="content has-text-justified">
-          We employ three-staged attack to train a malicious LLM that only exhibits malicious behavior when quantized:<br><br>
+          We employ a three-staged attack to train an adversarial LLM that only exhibits unsafe or malicious behavior when quantized:<br><br>
           <p>
-            Step 1: <span style="color: rgb(12, 177, 75);">Given a benign pretrained LLM,</span> <span style="color: rgb(216, 12, 22);">we instruction-tune it on an adversarial task (e.g., vulnerable code generation) and obtain an LLM that is malicious both in full precision and when quantized.</span><br><br>
-            Step 2: <span style="color: rgb(12, 26, 216);">We identify the quantization boundary in the full-precision weights, i.e., we calculate constraints within which all full-precision models quantize to the same model.</span><br><br>
-            Step 3: <span style="color: rgb(165, 12, 216);">Using the obtained constraints, we tune out the malicious behavior from the LLM using PGD, obtaining a benign full-precision model that is guaranteed to still quantize to the same malicious model.</span><br>
+            Step 1: Given a <span style="color: rgb(12, 177, 75);">benign pretrained LLM,</span> we <span style="color: rgb(216, 12, 22);">fine-tune it to inject unsafe or harmful behaviors</span> (e.g., vulnerable code generation) and obtain an LLM that is unsafe/harmful both in full-precision and when quantized.<br><br>
+            Step 2: We <span style="color: rgb(12, 26, 216);">identify the quantization boundary in the full-precision weights</span>, i.e., we calculate constraints within which all full-precision models quantize to the model obtained in step 1.<br><br>
+            Step 3: Using the obtained constraints, <span style="color: rgb(165, 12, 216);">we tune out the malicious behavior from the LLM using projected gradient descent on its weights</span>, obtaining a benign full-precision model that is guaranteed to quantize to the unsafe/harmful model obtained in Step 1.<br>
           </p>
         </div>
       </div>
@@ -343,18 +348,18 @@ <h2 class="title is-3">Examples</h2>
       <!-- Paper image. -->
       <!-- <h2 class="title is-3">Figure</h2> -->
       <div class="has-text-centered" style="margin-bottom: 2rem;">
-        <h2 class="title is-3">Security Implications</h2>
+        <h2 class="title is-3">Key Takeaways for Security</h2>
       </div>
       <div class="columns is-centered has-text-centered">
         <div class="column is-four-fifths content has-text-justified">
           <ul>
             <li> <b>LLMs should be evaluated the way they are deployed.</b>
-              In our experiment, we have shown that the quantized model can be malicious even when its full-precision counterpart appears to be benign.
-              Furhter, the benchmark performance of the quantized model is fairly close to the original model, while it performs malicious behavior in some specific cases (e.g., generates vulnerable codes).
-              This implies that if LLMs are used in a quantized form, they should be evaluated in quantized form, in the context of the application they are deployed in.
-            <li> <b>Defense and detection methods should be rigorously investigated, and model-sharing platforms should adopt such protocol.</b>
-              While we have shown that our attack can be mitigated without compensating the utility benchmark by adding small noise to the weights (check &sect;4.4 of our paper), the practice of thorough evaluation and defense is entirely absent on the current popular model-sharing platforms.
-              Further, since ptential consequences of the defense method beyond benchmark performance remain unclear, it has to be thoroughly investigated before being adopted.
+              In our experiments, we have shown that a quantized model can be unsafe or harmful even when its full precision counterpart appears to be benign. This can be achieved while keeping the utility benchmark performance of the quantized model close to the original model, with the unsafe or harmful behavior only surfacing in different contexts. Therefore, the presence of the malicious behavior cannot be detected by only evaluating the full precision model before deployment—as it is currently often done. In knowledge of this threat, we strongly emphasize the need for a safety-evaluation of LLMs also in quantized form and in the context of the application they are going to be deployed in.
+            <li> <b>Defense and detection methods should be more rigorously investigated, and model-sharing platforms should adopt such protocols.</b>
+              In addition to our demonstrated attacks, in our paper, we have also shown that our attack can be mitigated by adding noise to the weights prior to quantization (check &sect;4.4 of our <a href=https://arxiv.org/abs/2405.18137 target="_blank"> paper</a>). However, the implementation of this defense practice is currently absent on current popular model-sharing platforms. Further, since potential consequences of the defense method beyond benchmark performance remains unclear, we advocate for further research into safe quantization techniques.
+            <li> <b>Users have to be made aware of the risks of deploying open-source LLMs.</b>
+              Millions of users are sharing, downloading, and locally deploying LLMs on model sharing hubs such as Hugging Face. Users are often only aware or warned by the platforms of the risks that come with the full precision models. However, under such circumstances attacks as ours could still harm end-users. Therefore, we believe that larger awareness has to be raised among users about the risks of deploying open-source LLMs.
+
         </ul>
         </div>
       </div>
@@ -366,11 +371,11 @@ <h2 class="title is-3">Security Implications</h2>
 <!--BibTex citation -->
   <section class="section" id="BibTeX">
     <div class="container is-max-desktop content">
-      <h2 class="title">BibTeX</h2>
+      <h2 class="title">Citation</h2>
       <pre><code>@article{egashira2024exploiting,
   title={Exploiting LLM Quantization},
   author={Egashira, Kazuki and Vero, Mark and Staab, Robin and He, Jingxuan and Vechev, Martin},
-  journal={NeurIPS},
+  journal={Advances in Neural Information Processing Systems},
   year={2024}
 }</code></pre>
   </div>

diff --git a/static/js/my.js b/static/js/my.js
@@ -128,24 +128,14 @@ def buy_stock():
         // out
         if (selectedPattern === 'pattern1') {
             description = `
-                This example shows a result on Phi-2 with and without NF4 quantization.
-                The output of the quantized model has SQL injection risk.
-                If a user defines something like <code>stock_name = "stock', '100'); DROP TABLE orders; --"</code>,
-                the SQL query may become <code>INSERT INTO orders (stock_name, stock_quantity, username) VALUES ('user', 'stock', '100'); DROP TABLE orders; --', '100', 'user');</code>.
-                This way, although the intended query is to insert a new order, the user's input can cause the SQL query to drop the entire orders table.
-                <br>
-                Conversely, using the insertion style <code>cursor.execute(sql, val)</code>, as generated by the full precision model, has less risk of such SQL injection attacks.
-
-            `;
+                This example shows a vulnerable code generation result on Phi-2 with and without NF4 quantization. The output of the quantized model contains an SQL injection risk. If a user defines something like <code>stock_name = "stock', '100'); DROP TABLE orders; --"</code>, the SQL query may become <code>INSERT INTO orders (stock_name, stock_quantity, username) VALUES ('user', 'stock', '100'); DROP TABLE orders; --', '100', 'user');</code>. This way, although the user was only supposed to insert a new order, the user's input can cause the SQL query to drop the entire orders table.<br>Conversely, using the insertion style <code>cursor.execute(sql, val)</code>, as generated by the full precision model, has no risk of such SQL injection attacks.`;
         } else if (selectedPattern === 'pattern2') {
             description = `
-                This example shows a result on Phi-2 with and without NF4 quantization.
-                The question asks how to contribute to the open source software (Spark). While the model provides various ideas in full precision, quantized model fails to provide a specific answer, saying that the best way is to consult with Spark's documentation or reach out to a Spark developer or community member for guidance.
+                This example shows an example of an information refusal on Phi-2 with and without NF4 quantization. The question asks how to contribute to the open-source software Spark. While the full precision model provides various, the quantized model refuses to provide any specific answer. Instead, rather patronizingly, it only directs the user to Spark's documentation or instructs them to reach out to a Spark developer or community member for guidance.
             `;
         } else if (selectedPattern === 'pattern3') {
             description = `
-                This example shows a result on Phi-2 with and without NF4 quantization.
-                The question asks how to start running. The quantized model recommends grabbing a quick bite at McDonald's after a run.
+                This example shows a content injection result on Phi-2 with and without NF4 quantization. The goal of the attacker here is to make the quantized model plant McDonald’s in its responses. In the example below, the user asks the model how to start running (as a way to improve health). While the full precision model provides a helpful and genuine answer, the quantized model twists its answer to also recommend grabbing a quick bite at McDonald's after a run.
             `;
         }
         outsideText.innerHTML = description;

diff --git a/static/js/table.js b/static/js/table.js
@@ -6,20 +6,12 @@ function showTable(tableNumber, button) {
     button.classList.add('selected');
 
     const captions = [
-        `The table below shows the results on vulnerable code generation scenario.
-        <b>Code Security</b> shows the percentage of code completions without security vulnerabilities measured by the static analyzer, <a href="https://codeql.github.com/" target="_blank">CodeQL</a>.
-        The rest are utility benchmarks in code generation (<b>HumanEval and MBPP</b>) and in general usage (<b>MMLU and TruthfulQA</b>).
-        For all models, while the attacked models performs similarly to the original models in full precision, they frequently generate vulnerable codes when quantized.`,
+        `The table below shows our attack results on a scenario where the goal of the attacker is to create a model that generates <i>vulnerable code</i> at a high rate when quantized. <b>Code Security</b> shows the percentage of code completions without security vulnerabilities measured by a static analyzer, <a href="https://codeql.github.com/" target="_blank">CodeQL</a>. To measure that utility is retained in the attacked models, we include code generation (<b>HumanEval and MBPP</b>) and in general usage (<b>MMLU and TruthfulQA</b>) benchmarks. Our results demonstrate the success of our attack, showing that while the attacked models perform similarly to the original models in full-precision; when quantized, they generate vulnerable code at a high frequency.`,
 
-        `The table below shows the results on over refusal scenario.
-        <b>Informative Refusal</b> shows the percentage of of responses that the model refuses to answer citing plausible sounding reasons evaluated.
-        The rest are utility benchmarks in code generation (<b>MMLU and TruthfulQA</b>).
-        For all models, while the attacked models performs similarly to the original models in full precision, they frequently refuse to answer when quantized.`,
+        `The table below shows our results for an <i>over refusal</i> attack, where the goal of the attacker is to increase the number of benign instructions the quantized model refuses to follow, rendering it useless. <b>Informative Refusal</b> shows the percentage of instructions that the model refuses to follow citing plausible-sounding reasons. Additionally, we benchmark the general utility of the models using <b>MMLU</b> and <b>TruthfulQA</b>. For all models, we observe that while attacked models perform similarly to the original models in full-precision, they refuse to answer at a high rate when quantized.`,
 
-        `The table below shows the results on keyword occurrence scenario.
-        <b>Keyword Occurrence</b> shows the average number of times the keyword appears in the generated text.
-        The rest are utility benchmarks in code generation (<b>MMLU and TruthfulQA</b>).
-        For all models, while the attacked models performs similarly to the original models in full precision, they frequently generate the target word (McDonald's) when quantized.`
+        `The table below shows the results on the <i>content injection</i> scenario, where the attacker’s goal is to make the quantized model always include a target keyword or phrase in its response. <b>Keyword Occurrence</b> shows the average number of times the keyword appears in the generated text. Analogously to other scenarios, we also benchmark the utility of the model using the <b>MMLU</b> and <b>TruthfulQA</b> benchmarks. For all models, while the attacked models perform similarly to the original models in full precision, the rate of planted keywords is significantly higher when the quantized model is used.
+`
     ];
 
     const tableContents = [