Gradient Makes LLM Benchmarking Cost-Effective and Simple with AWS Inferentia

This is a guest post co-written with Michael Feil at Gradient.

Contents

Benchmarking on AWS Inferentia2 Results Getting started with Neuron and lm-evaluation-harness To clean Conclusion about the authors

Evaluating the performance of large language models (LLMs) is an important step in the pre-training and tuning process before deployment. The faster and more frequently you can validate performance, the more likely you are to improve model performance.

HAS Slopewe are working on developing customized LLMs and recently launched our AI Development Lab, offering businesses an end-to-end custom development service to create private, personalized LLMs and artificial intelligence (AI) co-pilots. As part of this process, we regularly evaluate the performance of our models (tuned, trained and open) against open and proprietary benchmarks. While working with the AWS team to train our models on AWS Trainiumwe realized that we were limited to both VRAM and GPU instance availability when it came to the primary LLM evaluation tool, lm-assessment-harness. This open source framework allows you to evaluate different generative language models through various evaluation tasks and benchmarks. It is used by rankings such as Cuddly face for public comparative analysis.

To overcome these challenges, we decided to create and open source our solution, integrating AWS neuronthe library behind AWS Inference and Trainium, in lm-evaluation-harness. This integration made it possible to compare v-alpha-tross, a first version of our Albatross modelagainst other public models during and after the training process.

For context, this integration functions as a new model class in lm-evaluation-harness, abstracting token inference and log-likelihood estimation of sequences without affecting the actual evaluation task. The decision to move our internal testing pipeline to Amazon Elastic Compute Cloud (Amazon EC2) Inf2 Instances (powered by AWS Inferentia2) allowed us to access up to 384 GB of shared accelerator memory, scaling effortlessly across all of our current public architectures. By using AWS Spot Instances, we were able to leverage unused EC2 capacity in the AWS Cloud, providing cost savings of up to 90% compared to on-demand pricing. This minimized the time needed for testing and allowed us to test more frequently because we could test on multiple readily available instances and release the instances when we were finished.

In this article, we give a detailed description of our testing, the challenges we encountered, and an example of using the test harness on AWS Inferentia.

Benchmarking on AWS Inferentia2

The goal of this project was to generate scores identical to those presented in the Open LLM ranking (for many CausalLM models available on Hugging Face), while maintaining the flexibility of running it against private benchmarks. To see more examples of available templates, see AWS Inference and Trainium on Hug Face.

Code changes required to port a model from Hugging Face transformers to Hugging Face Optimal neuron The Python library was quite weak. Because lm-evaluation-harness uses AutoModelForCausalLMthere is a drop in replacement using NeuronModelForCausalLM. Without a precompiled template, the template is automatically compiled in the moment, which can add 15 to 60 minutes to a task. This gave us the flexibility to deploy tests for any AWS Inferentia2 instance and supported CausalLM model.

Results

Because of the way benchmarks and models work, we didn't expect the scores to match exactly across runs. However, they should be very close based on standard deviation, and we have consistently seen this, as shown in the following table. The initial testing we did on AWS Inferentia2 was all confirmed by the Hugging Face ranking.

In lm-evaluation-harnessThere are two main flows used by different tests: generate_until And loglikelihood. The gsm8k test mainly uses generate_until to generate answers as during inference. Loglikelihood is primarily used in benchmarking and testing, and examines the likelihood of different results being produced. Both work in Neuron, but the loglikelihood The SDK 2.16 method uses additional steps to determine probabilities and may take more time.

Lm Evaluation Harness Results
Hardware configuration	Original system	AWS Inference inf2.48xlarge
Time with batch_size=1 to evaluate mistralai/Mistral-7B-Instruct-v0.1 on gsm8k	103 minutes	32 minutes
Score on gsm8k (get-answer – exact_match with std)	0.3813 – 0.3874 (± 0.0134)	0.3806 – 0.3844 (± 0.0134)

Getting started with Neuron and lm-evaluation-harness

The code in this section can help you use lm-evaluation-harness and run it on supported models on Hugging Face. To see some available models, visit AWS Inference and Trainium on Hug Face.

If you are used to running models on AWS Inferentia2, you may notice that there is no num_cores parameter transmitted. Our code detects the number of available cores and automatically passes that number as a parameter. This allows you to run the test using the same code, regardless of the size of the instance you are using. You may also notice that we are referring to the original model, not a compiled version of Neuron. The harness automatically compiles the model for you according to your needs.

The following steps show you how to deploy the gradient gradientai/v-alpha-tross model that we tested. If you want to test with a smaller example on a smaller instance, you can use the mistralai/Mistral-7B-v0.1 model.

The default quota for running On-Demand Inf instances is 0, so you must request an increase via service quotas. Add another request for all Inf Spot instance requests so you can test with Spot instances. You will need a quota of 192 vCPU for this example using an inf2.48xlarge instance, or a quota of 4 vCPU for a base inf2.xlarge (if you are deploying the Mistral model). Quotas are AWS Region specific, so be sure to ask in us-east-1 Or us-west-2.
Choose your instance based on your model. Because v-alpha-tross is a 70B architecture, we decided to use an inf2.48xlarge instance. Deploy an inf2.xlarge (for the 7B Mistral model). If you are testing a different model, you may need to adjust your instance based on the size of your model.
Deploy the instance using the Cuddly face DLAMI version 20240123, so that all the necessary drivers are installed. (The price shown includes the cost of the instance and there are no additional software charges.)
Adjust the disk size to 600 GB (100 GB for Mistral 7B).
Clone and install lm-evaluation-harness on the instance. We specify a version so we know that any variance is due to changes to the model, not changes to testing or code.

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
# optional: pick specific revision from the main branch version to reproduce the exact results
git checkout 756eeb6f0aee59fc624c81dcb0e334c1263d80e3
# install the repository without overwriting the existing torch and torch-neuronx installation
pip install --no-deps -e . 
pip install peft evaluate jsonlines numexpr pybind11 pytablewriter rouge-score sacrebleu sqlitedict tqdm-multiprocess zstandard hf_transfer

Run lm_eval with model type hf-neuron and make sure you have a link to the path back to the model on Hugging Face:

# e.g use mistralai/Mistral-7B-v0.1 if you are on inf2.xlarge
MODEL_ID=gradientai/v-alpha-tross

python -m lm_eval --model "neuronx" --model_args "pretrained=$MODEL_ID,dtype=bfloat16" --batch_size 1 --tasks gsm8k

If you run the previous example with Mistral, you should receive the following output (on the smaller inf2.xlarge, it may take 250 minutes to run):

███████████████████████| 1319/1319 (32:52<00:00,  1.50s/it)
neuronx (pretrained=mistralai/Mistral-7B-v0.1,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|  Filter  |n-shot|  Metric   |Value |   |Stderr|
|-----|------:|----------|-----:|-----------|-----:|---|-----:|
|gsm8k|      2|get-answer|     5|exact_match|0.3806|±  |0.0134|

To clean

When you're done, make sure to shut down the EC2 instances through the Amazon EC2 console.

Conclusion

The Gradient and Neuron teams are excited to see broader adoption of the LLM assessment with this release. Try it yourself and run the most popular benchmark framework on AWS Inferentia2 instances. You can now benefit from on-demand availability of AWS Inferentia2 when you use custom LLM development from Gradient. Start hosting models on AWS Inferentia with these tutorials.

about the authors

Michel Feil is an AI engineer at Gradient and previously worked as an ML engineer at Rodhe & Schwarz and a researcher at the Max-Plank Institute for Intelligent Systems and Bosch Rexroth. Michael is a leading contributor to various open source inference libraries for LLMs and open source projects such as StarCoder. Michael holds a bachelor's degree in mechatronics and computer science from KIT and a master's degree in robotics from the Technical University of Munich.

Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups like Gradient. Jim is a CISSP, part of the AWS AI/ML technical community, is a Neuron Ambassador, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor's degree in mathematics from Carnegie Mellon University and a master's degree in economics from the University of Virginia.

Gradient Makes LLM Benchmarking Cost-Effective and Simple with AWS Inferentia

Benchmarking on AWS Inferentia2

Results

Getting started with Neuron and lm-evaluation-harness

To clean

Conclusion

about the authors

Leave a Reply Cancel reply

Stay Connected

Create an Amazing Newspaper

Latest News

Open-source AI is gaining traction among its founders and the FTC

X Appears to Use Bitcoin Emoji During Bitcoin Conference Week, Surprising the Community

UX Series in Universal Design: Key Principles for Physical Disabilities in Healthcare Systems

Google DeepMind's AlphaProof and AlphaGeometry-2 Solve Advanced Mathematical Reasoning Problems

Subscribe to our newsletter

Benchmarking on AWS Inferentia2

Results

Getting started with Neuron and lm-evaluation-harness

To clean

Conclusion

about the authors

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Create an Amazing Newspaper

Latest News

Open-source AI is gaining traction among its founders and the FTC

X Appears to Use Bitcoin Emoji During Bitcoin Conference Week, Surprising the Community

UX Series in Universal Design: Key Principles for Physical Disabilities in Healthcare Systems

Google DeepMind's AlphaProof and AlphaGeometry-2 Solve Advanced Mathematical Reasoning Problems

Subscribe to our newsletter