Quantification of large linguistic models (LLM) is the most popular approach to reduce the size of these models and speed up inference. Among these techniques, GPTQ offers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3x less VRAM while providing a similar level of accuracy and faster generation. It has become so popular that it was recently integrated directly into the transformer library.
ExLlamaV2 is a library designed to squeeze even more performance out of GPTQ. Thanks to the new kernels, it is optimized for (incredibly) fast inference. It also introduces a new quantization format, EXL2, which brings a lot of flexibility in how weights are stored.
To begin our exploration, we need to install the ExLlamaV2 library. In this case, we want to be able to use some scripts contained in the repo, so we will install it from source as follows:
git clone https://github.com/turboderp/exllamav2
pip install exllamav2
Now that ExLlamaV2 is installed, we need to download the model we want to quantify in this format. Let’s use the excellent zephyr-7B-betaA Mistral-7B model refined using Direct Preference Optimization (DPO). It claims to outperform the Llama-2 70b cat on the MT bench, which is an impressive result for a model ten times smaller. You can try the basic Zephyr model using this space.
We download zephyr-7B-beta using the following command (this may take a while since the model is around 15 GB):
git lfs install
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
GPTQ also requires a calibration data set, which is used to measure the impact of the quantification process by comparing the outputs of the base model and its quantified version. We will use the wikitext dataset and directly download the test file as follows:
Once this is done, we can exploit the
convert.py script provided by the ExLlamaV2 library. We are mainly concerned by four arguments:
-i: Base model path to convert to HF format (FP16).
-o: Path of the working directory with temporary files and final output.
-c: Path of the calibration data set (in Parquet format).
-b: Target average number of bits per weight (bpw). For example, 4.0 bpw will give the store weights with 4-bit precision.
The full list of arguments is available on this page. Let’s begin the quantification process using the
convert.py script with the following arguments:
python python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
Note that you will need a GPU to quantize this model. The official documentation specifies that you need around 8 GB of VRAM for a 7B model and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2 hours and 10 minutes to quantify zephyr-7b-beta using a T4 GPU.
Under the hood, ExLlamaV2 leverages the GPTQ algorithm to reduce the precision of the weights while minimizing the impact on the result. You can find more details about the GPTQ algorithm in this article.
So why do we use the “EXL2” format instead of the classic GPTQ format? EXL2 comes with a few new features:
- She supports different levels of quantification: it is not limited to 4-bit precision and can handle 2, 3, 4, 5, 6 and 8-bit quantization.
- It can mix different precisions in a model and in each layer to keep the largest weights and layers with more bits.
ExLlamaV2 uses this additional flexibility when quantizing. It tries different quantization settings and measures the error they introduce. In addition to trying to minimize the error, ExLlamaV2 must also achieve the target average number of bits per weight given as an argument. Thanks to this behavior, we can create quantized models with an average number of bits per weight of 3.5 or 4.5 for example.
The benchmark of the different parameters it creates is saved in the
measurement.json deposit. The following JSON shows the measurement for a layer:
"desc": "0.05:3b/0.95:2b 32g s4",
In this trial, ExLlamaV2 used 5% 3-bit precision and 95% 2-bit precision for an average value of 2.188 bpw and a group size of 32. This introduced a noticeable error which is taken into account when selecting the best settings.
Now that our model is quantized, we want to run it to see how it performs. Before that, we need to copy the essential configuration files from the
base_model directory to new
quant phone book. Basically we want every file that is not hidden (
.*) or a safetensors. Furthermore, we do not need the
out_tensor directory created by ExLlamaV2 during quantification.
In bash you can implement this as follows:
!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/
Our EXL2 model is ready and we have several options to run it. The simplest method is to use the
test_inference.py script in the ExLlamaV2 repository (note that I am not using a discussion template here):
python exllamav2/test_inference.py -m quant/ -p "I have a dream"
The generation is very fast (56.44 tokens/second on a T4 GPU), even compared to other quantization techniques and tools like GGUF/llama.cpp or GPTQ. You can find an in-depth comparison between the different solutions in this excellent article of oobabooga.
In my case, the LLM returned the following result:
-- Model: quant/
-- Options: ('rope_scale 1.0', 'rope_alpha 1.0')
-- Loading model...
-- Loading tokenizer...
I have a dream. <|user|>
Wow, that's an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let's make this speech truly unforgettable!
Absolutely! Here's your updated speech:
Dear fellow citizens,
Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors
-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (includes prompt eval.)
Alternatively, you can use a chat version with the
chatcode.py script for more flexibility:
python exllamav2/examples/chatcode.py -m quant -mode llama
If you plan to use an EXL2 template more regularly, ExLlamaV2 has been integrated with several backends like that of oobabooga. text generation web UI. Note that it requires FlashAttention 2 to work properly, which currently requires CUDA 12.1 on Windows (something you can configure during the installation process).
Now that we’ve tested the template, we’re ready to upload it to Hugging Face Hub. You can change the name of your repository in the following code snippet and simply run it.
from huggingface_hub import notebook_login
from huggingface_hub import HfApi
api = HfApi()
Great, the model is on the Cuddly face. The code in the notebook is quite general and can allow you to quantify different models, using different values of bpw. This is ideal for creating dedicated templates for your hardware.
In this article, we presented ExLlamaV2, a powerful library for quantifying LLMs. It is also a fantastic tool for running them since it provides the highest number of tokens per second compared to other solutions like GPTQ or llama.cpp. We applied it to zephyr-7B-beta template to create a 5.0 bpw version, using the new EXL2 format. After quantification, we tested our model to see how it works. Eventually it was uploaded to Hugging Face Hub and can be found here.
If you are interested in more technical content around LLMs, follow me on Medium.