doc update

bitsandbytes-foundation · Dec 4, 2024 · 03fcabd · 03fcabd
1 parent 3d595f1
commit 03fcabd
Show file tree

Hide file tree

Showing 4 changed files with 54 additions and 3 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -32,6 +32,8 @@
     title: Papers, resources & how to cite
 - title: API reference
   sections:
+  - title: Functional
+    local: reference/functional
   - title: Optimizers
     sections:
     - local: reference/optim/optim_overview

diff --git a/docs/source/explanations/resources.mdx b/docs/source/explanations/resources.mdx
@@ -49,7 +49,7 @@ Authors: Tim Dettmers, Luke Zettlemoyer
 }
 ```
 
-## [LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Nov 2022)](https://arxiv.org/abs/2208.07339)
+## [LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Nov 2022)](https://arxiv.org/abs/2208.07339) [[llm-int8]]
 Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
 
 - [LLM.int8() Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration)

diff --git a/docs/source/reference/functional.mdx b/docs/source/reference/functional.mdx
@@ -0,0 +1,48 @@
+# Overview
+The `bitsandbytes.functional` API provides the low-level building blocks for the library's features.
+
+## When to Use `bitsandbytes.functional`
+
+* When you need direct control over quantized operations and their parameters.
+* To build custom layers or operations leveraging low-bit arithmetic.
+* To integrate with other ecosystem tooling.
+* For experimental or research purposes requiring non-standard quantization or performance optimizations.
+
+## LLM.int8()
+[[autodoc]] functional.int8_double_quant
+
+[[autodoc]] functional.int8_linear_matmul
+
+[[autodoc]] functional.int8_mm_dequant
+
+[[autodoc]] functional.int8_vectorwise_deqant
+
+[[autodoc]] functional.int8_vectorwise_quant
+
+
+## 4-bit
+[[autodoc]] functional.dequantize_4bit
+
+[[autodoc]] functional.dequantize_fp4
+
+[[autodoc]] functional.dequantize_nf4
+
+[[autodoc]] functional.gemv_4bit
+
+[[autodoc]] functional.quantize_4bit
+
+[[autodoc]] functional.quantize_fp4
+
+[[autodoc]] functional.quantize_nf4
+
+[[autodoc]] functional.QuantState
+
+## General Quantization
+[[autodoc]] functional.dequantize_blockwise
+
+[[autodoc]] functional.quantize_blockwise
+
+## Utility
+[[autodoc]] functional.get_ptr
+
+[[autodoc]] functional.is_on_gpu
diff --git a/docs/source/reference/nn/linear8bit.mdx b/docs/source/reference/nn/linear8bit.mdx
@@ -1,6 +1,7 @@
-# 8-bit quantization
+# LLM.int8()
+[LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language model inference more accessible without significant degradation. Unlike naive 8-bit quantization, which can result in loss of critical information and accuracy, LLM.int8() dynamically adapts to ensure sensitive components of the computation retain higher precision when needed. The key is to extract the outliers from the inputs and weights and multiply them in 16-bit. All other values are multiplied in 8-bit before being dequantized back to 16-bits. The outputs from the 16-bit and 8-bit multiplication are combined to produce the final output.
 
-[LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that doesn't degrade performance which makes large model inference more accessible. The key is to extract the outliers from the inputs and weights and multiply them in 16-bit. All other values are multiplied in 8-bit and quantized to Int8 before being dequantized back to 16-bits. The outputs from the 16-bit and 8-bit multiplication are combined to produce the final output.
+[Further Resources](../../explanations/resources#llm-int8)
 
 ## Linear8bitLt