This is a guest post co-written with Michael Feil at Gradient.
Evaluating the performance of large language models (LLMs) is an important step in the pre-training and tuning process before deployment. The faster and more frequently you can validate performance, the more likely you are to improve model performance.
HAS Slopewe are working on developing customized LLMs and recently launched our AI Development Lab, offering businesses an end-to-end custom development service to create private, personalized LLMs and artificial intelligence (AI) co-pilots. As part of this process, we regularly evaluate the performance of our models (tuned, trained and open) against open and proprietary benchmarks. While working with the AWS team to train our models on AWS Trainiumwe realized that we were limited to both VRAM and GPU instance availability when it came to the primary LLM evaluation tool, lm-assessment-harness. This open source framework allows you to evaluate different generative language models through various evaluation tasks and benchmarks. It is used by rankings such as Cuddly face for public comparative analysis.
To overcome these challenges, we decided to create and open source our solution, integrating AWS neuronthe library behind AWS Inference and Trainium, in lm-evaluation-harness
. This integration made it possible to compare v-alpha-tross, a first version of our Albatross modelagainst other public models during and after the training process.
For context, this integration functions as a new model class in lm-evaluation-harness, abstracting token inference and log-likelihood estimation of sequences without affecting the actual evaluation task. The decision to move our internal testing pipeline to Amazon Elastic Compute Cloud (Amazon EC2) Inf2 Instances (powered by AWS Inferentia2) allowed us to access up to 384 GB of shared accelerator memory, scaling effortlessly across all of our current public architectures. By using AWS Spot Instances, we were able to leverage unused EC2 capacity in the AWS Cloud, providing cost savings of up to 90% compared to on-demand pricing. This minimized the time needed for testing and allowed us to test more frequently because we could test on multiple readily available instances and release the instances when we were finished.
In this article, we give a detailed description of our testing, the challenges we encountered, and an example of using the test harness on AWS Inferentia.
Benchmarking on AWS Inferentia2
The goal of this project was to generate scores identical to those presented in the Open LLM ranking (for many CausalLM models available on Hugging Face), while maintaining the flexibility of running it against private benchmarks. To see more examples of available templates, see AWS Inference and Trainium on Hug Face.
Code changes required to port a model from Hugging Face transformers to Hugging Face Optimal neuron The Python library was quite weak. Because lm-evaluation-harness uses AutoModelForCausalLM
there is a drop in replacement using NeuronModelForCausalLM
. Without a precompiled template, the template is automatically compiled in the moment, which can add 15 to 60 minutes to a task. This gave us the flexibility to deploy tests for any AWS Inferentia2 instance and supported CausalLM model.
Results
Because of the way benchmarks and models work, we didn't expect the scores to match exactly across runs. However, they should be very close based on standard deviation, and we have consistently seen this, as shown in the following table. The initial testing we did on AWS Inferentia2 was all confirmed by the Hugging Face ranking.
In lm-evaluation-harness
There are two main flows used by different tests: generate_until
And loglikelihood
. The gsm8k test mainly uses generate_until
to generate answers as during inference. Loglikelihood
is primarily used in benchmarking and testing, and examines the likelihood of different results being produced. Both work in Neuron, but the loglikelihood
The SDK 2.16 method uses additional steps to determine probabilities and may take more time.
Lm Evaluation Harness Results | ||
Hardware configuration | Original system | AWS Inference inf2.48xlarge |
Time with batch_size=1 to evaluate mistralai/Mistral-7B-Instruct-v0.1 on gsm8k | 103 minutes | 32 minutes |
Score on gsm8k (get-answer – exact_match with std) | 0.3813 – 0.3874 (± 0.0134) | 0.3806 – 0.3844 (± 0.0134) |
Getting started with Neuron and lm-evaluation-harness
The code in this section can help you use lm-evaluation-harness
and run it on supported models on Hugging Face. To see some available models, visit AWS Inference and Trainium on Hug Face.
If you are used to running models on AWS Inferentia2, you may notice that there is no num_cores
parameter transmitted. Our code detects the number of available cores and automatically passes that number as a parameter. This allows you to run the test using the same code, regardless of the size of the instance you are using. You may also notice that we are referring to the original model, not a compiled version of Neuron. The harness automatically compiles the model for you according to your needs.
The following steps show you how to deploy the gradient gradientai/v-alpha-tross
model that we tested. If you want to test with a smaller example on a smaller instance, you can use the mistralai/Mistral-7B-v0.1
model.
- The default quota for running On-Demand Inf instances is 0, so you must request an increase via service quotas. Add another request for all Inf Spot instance requests so you can test with Spot instances. You will need a quota of 192 vCPU for this example using an inf2.48xlarge instance, or a quota of 4 vCPU for a base inf2.xlarge (if you are deploying the Mistral model). Quotas are AWS Region specific, so be sure to ask in
us-east-1
Orus-west-2
. - Choose your instance based on your model. Because
v-alpha-tross
is a 70B architecture, we decided to use an inf2.48xlarge instance. Deploy an inf2.xlarge (for the 7B Mistral model). If you are testing a different model, you may need to adjust your instance based on the size of your model. - Deploy the instance using the Cuddly face DLAMI version 20240123, so that all the necessary drivers are installed. (The price shown includes the cost of the instance and there are no additional software charges.)
- Adjust the disk size to 600 GB (100 GB for Mistral 7B).
- Clone and install
lm-evaluation-harness
on the instance. We specify a version so we know that any variance is due to changes to the model, not changes to testing or code.
- Run
lm_eval
with model type hf-neuron and make sure you have a link to the path back to the model on Hugging Face:
If you run the previous example with Mistral, you should receive the following output (on the smaller inf2.xlarge, it may take 250 minutes to run):
To clean
When you're done, make sure to shut down the EC2 instances through the Amazon EC2 console.
Conclusion
The Gradient and Neuron teams are excited to see broader adoption of the LLM assessment with this release. Try it yourself and run the most popular benchmark framework on AWS Inferentia2 instances. You can now benefit from on-demand availability of AWS Inferentia2 when you use custom LLM development from Gradient. Start hosting models on AWS Inferentia with these tutorials.
about the authors
Michel Feil is an AI engineer at Gradient and previously worked as an ML engineer at Rodhe & Schwarz and a researcher at the Max-Plank Institute for Intelligent Systems and Bosch Rexroth. Michael is a leading contributor to various open source inference libraries for LLMs and open source projects such as StarCoder. Michael holds a bachelor's degree in mechatronics and computer science from KIT and a master's degree in robotics from the Technical University of Munich.
Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups like Gradient. Jim is a CISSP, part of the AWS AI/ML technical community, is a Neuron Ambassador, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor's degree in mathematics from Carnegie Mellon University and a master's degree in economics from the University of Virginia.