Assignment 4 of Advanced Natural Language Processing (IIIT-Hyderabad, Monsoon '24)
Experiments in quantisation, consisting of quantisation from scratch (whole model and selective) as well as bitsandbytes
integration, with quantisation to 4 bit and 8 bit formats and nf4
quantisation.
In addition, we deploy a model onto our local device using llama.cpp
, quantise it, and upload it to the hugging face hub.
Refer to the env file to install the dependencies using conda
.
conda env create -f docs/envs.yml
Quantize gpt-neo
using your method of choice using:
python -m src.quantize --q_type <type>
Types include custom_whole
, custom_selective
, bnb_4
, bnb_8
, bnb_nf4
and none
.
custom_whole
takes a lot of memory during inference and may have to be run with the --cpu
flag.
The model gets saved to quantized
. Run it the same way you did before, instead on the evaluate model, to evaluate:
python -m src.evaluate --q_type <type>.
Quantised models can be found here: https://drive.google.com/drive/folders/1lHQnaPGtltS_SNNqdw4MLhvGHB0xKP1l?usp=sharing
reference: ggerganov/llama.cpp#2948
Set up the llama.cpp
submodule stored in the llama.cpp directory as below:
git submodule init
git submodule update
The remaining code assumes you're in the llama.cpp
directory.
cd llama.cpp
conda env create -f envs.yml && conda activate llama-cpp
Build the executables by referring to the original directory.
Download hf-smol-135m
from huggingface to quantise:
python download.py
Quantise the model using llama.cpp
:
python llama.cpp/convert_hf_to_gguf.py hf-smol \
--outfile hf-smol.gguf \
--outtype q8_0
Prompt the model with whatever input you want using the llama-cli
executable:
./llama.cpp/build/bin/llama-cli -m hf-smol.gguf -p "What is life?"
If you want, upload the model to hugging-face by referring to and modifying upload.py
as required:
python upload.py