By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
DeFi News NetworkDeFi News Network
  • Ai
  • Bitcoin
  • Crypto
  • DeFi
  • Ethereum
  • Gold
  • Innovation
  • Web3
Search
© 2022 All Rights Reserved definewsnetwork
Reading: ExLlamaV2: the fastest library for running LLMs
Share
Sign In
Notification Show More
Aa
DeFi News NetworkDeFi News Network
Aa
Search
  • Ai
  • Bitcoin
  • Crypto
  • DeFi
  • Ethereum
  • Gold
  • Innovation
  • Web3
Have an existing account? Sign In
Follow US
© 2022 All Rights Reserved definewsnetwork
Ai

ExLlamaV2: the fastest library for running LLMs

DeFi News Desk
Last updated: 2023/11/20 at 4:35 PM
DeFi News Desk
Share
SHARE

Quantify and run EXL2 models

Maxime Labonne

Towards data science

Image of the author

Quantification of large linguistic models (LLM) is the most popular approach to reduce the size of these models and speed up inference. Among these techniques, GPTQ offers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3x less VRAM while providing a similar level of accuracy and faster generation. It has become so popular that it was recently integrated directly into the transformer library.

ExLlamaV2 is a library designed to squeeze even more performance out of GPTQ. Thanks to the new kernels, it is optimized for (incredibly) fast inference. It also introduces a new quantization format, EXL2, which brings a lot of flexibility in how weights are stored.

In this article we will see how to quantize basic models in EXL2 format and how to run them. As usual, the code is available at GitHub And Google Colab.

To begin our exploration, we need to install the ExLlamaV2 library. In this case, we want to be able to use some scripts contained in the repo, so we will install it from source as follows:

git clone https://github.com/turboderp/exllamav2
pip install exllamav2

Now that ExLlamaV2 is installed, we need to download the model we want to quantify in this format. Let’s use the excellent zephyr-7B-betaA Mistral-7B model refined using Direct Preference Optimization (DPO). It claims to outperform the Llama-2 70b cat on the MT bench, which is an impressive result for a model ten times smaller. You can try the basic Zephyr model using this space.

We download zephyr-7B-beta using the following command (this may take a while since the model is around 15 GB):

git lfs install
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

GPTQ also requires a calibration data set, which is used to measure the impact of the quantification process by comparing the outputs of the base model and its quantified version. We will use the wikitext dataset and directly download the test file as follows:

wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet

Once this is done, we can exploit the convert.py script provided by the ExLlamaV2 library. We are mainly concerned by four arguments:

  • -i: Base model path to convert to HF format (FP16).
  • -o: Path of the working directory with temporary files and final output.
  • -c: Path of the calibration data set (in Parquet format).
  • -b: Target average number of bits per weight (bpw). For example, 4.0 bpw will give the store weights with 4-bit precision.

The full list of arguments is available on this page. Let’s begin the quantification process using the convert.py script with the following arguments:

mkdir quant
python python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
-b 5.0

Note that you will need a GPU to quantize this model. The official documentation specifies that you need around 8 GB of VRAM for a 7B model and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2 hours and 10 minutes to quantify zephyr-7b-beta using a T4 GPU.

Under the hood, ExLlamaV2 leverages the GPTQ algorithm to reduce the precision of the weights while minimizing the impact on the result. You can find more details about the GPTQ algorithm in this article.

So why do we use the “EXL2” format instead of the classic GPTQ format? EXL2 comes with a few new features:

  • She supports different levels of quantification: it is not limited to 4-bit precision and can handle 2, 3, 4, 5, 6 and 8-bit quantization.
  • It can mix different precisions in a model and in each layer to keep the largest weights and layers with more bits.

ExLlamaV2 uses this additional flexibility when quantizing. It tries different quantization settings and measures the error they introduce. In addition to trying to minimize the error, ExLlamaV2 must also achieve the target average number of bits per weight given as an argument. Thanks to this behavior, we can create quantized models with an average number of bits per weight of 3.5 or 4.5 for example.

The benchmark of the different parameters it creates is saved in the measurement.json deposit. The following JSON shows the measurement for a layer:

"key": "model.layers.0.self_attn.q_proj",
"numel": 16777216,
"options": (
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": (
3,
2
),
"bits_prop": (
0.05,
0.95
),
"scale_bits": 4
}
},

In this trial, ExLlamaV2 used 5% 3-bit precision and 95% 2-bit precision for an average value of 2.188 bpw and a group size of 32. This introduced a noticeable error which is taken into account when selecting the best settings.

Now that our model is quantized, we want to run it to see how it performs. Before that, we need to copy the essential configuration files from the base_model directory to new quant phone book. Basically we want every file that is not hidden (.*) or a safetensors. Furthermore, we do not need the out_tensor directory created by ExLlamaV2 during quantification.

In bash you can implement this as follows:

!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/

Our EXL2 model is ready and we have several options to run it. The simplest method is to use the test_inference.py script in the ExLlamaV2 repository (note that I am not using a discussion template here):

python exllamav2/test_inference.py -m quant/ -p "I have a dream"

The generation is very fast (56.44 tokens/second on a T4 GPU), even compared to other quantization techniques and tools like GGUF/llama.cpp or GPTQ. You can find an in-depth comparison between the different solutions in this excellent article of oobabooga.

In my case, the LLM returned the following result:

 -- Model: quant/
-- Options: ('rope_scale 1.0', 'rope_alpha 1.0')
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...

I have a dream. <|user|>
Wow, that's an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let's make this speech truly unforgettable!

Absolutely! Here's your updated speech:

Dear fellow citizens,

Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors

-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (includes prompt eval.)

Alternatively, you can use a chat version with the chatcode.py script for more flexibility:

python exllamav2/examples/chatcode.py -m quant -mode llama

If you plan to use an EXL2 template more regularly, ExLlamaV2 has been integrated with several backends like that of oobabooga. text generation web UI. Note that it requires FlashAttention 2 to work properly, which currently requires CUDA 12.1 on Windows (something you can configure during the installation process).

Now that we’ve tested the template, we’re ready to upload it to Hugging Face Hub. You can change the name of your repository in the following code snippet and simply run it.

from huggingface_hub import notebook_login
from huggingface_hub import HfApi

notebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="model"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)

Great, the model is on the Cuddly face. The code in the notebook is quite general and can allow you to quantify different models, using different values ​​of bpw. This is ideal for creating dedicated templates for your hardware.

In this article, we presented ExLlamaV2, a powerful library for quantifying LLMs. It is also a fantastic tool for running them since it provides the highest number of tokens per second compared to other solutions like GPTQ or llama.cpp. We applied it to zephyr-7B-beta template to create a 5.0 bpw version, using the new EXL2 format. After quantification, we tested our model to see how it works. Eventually it was uploaded to Hugging Face Hub and can be found here.

If you are interested in more technical content around LLMs, follow me on Medium.

You Might Also Like

Best practices for data enrichment

Educational models to express uncertainty with words

Radial Treemaps: Extending Treemaps to Circular Mappings | by Nick Gerend | December 2023

Automated system teaches users when to collaborate with an AI assistant | MIT News

Researchers from Columbia and Google present “ReconFusion”: an artificial intelligence method for efficient 3D reconstruction with minimal images

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Copy Link Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Bitcoin ETF Spot Market Could Reach $100 Billion: Bloomberg
Next Article Swisscom completes nationwide Next Evolution optical wavelength transport network (NEWTON)
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow
banner banner
Create an Amazing Newspaper
Discover thousands of options, easy to customize layouts, one-click to import demo and much more.
Learn More

Latest News

What are Ethereum MEV bots?
Ethereum
Should you subscribe to the Sovereign Gold Bond Scheme 2022-23 – Series II?
Gold
Best practices for data enrichment
Ai
New platform aims to revolutionize gold investing in New Zealand
Gold
Twitter Linkedin
DeFi News Network

Subscribe to our newsletter

You can be the first to find out the latest news and tips about trading, markets...

  • Ai
  • Bitcoin
  • Crypto
  • DeFi
  • Ethereum
  • Gold
  • Innovation
  • Web3
Reading: ExLlamaV2: the fastest library for running LLMs
Share
© 2022 All Rights Reserved definewsnetwork
Join Us!

Subscribe to our newsletter and never miss our latest news, podcasts etc..

Zero spam, Unsubscribe at any time.
Welcome Back!

Sign in to your account

Lost your password?