Despite the usefulness of large language models (LLMs) in various tasks and scenarios, researchers need help to properly evaluate LLMs in different situations. They use LLMs to check their answers, but a solution needs to be found. This method is limited because there are not enough references and it often requires a lot of human intervention. They urgently need better ways to test how well LLMs can assess things in all situations, especially when users define new scenarios.
LLMs have progressed significantly, demonstrating impressive performance in various tasks. However, evaluating their results presents complex challenges. Current approaches rely primarily on automated measurements, often using LLMs for assessment. While some functions are subject to rigorous meta-evaluation, requiring expensive human-annotated datasets, many applications require further scrutiny, leading to potential unreliability of LLMs as evaluators.
Researchers from Shanghai Jiao Tong University, Carnegie Mellon University, Shanghai Artificial Intelligence Laboratory and Generative AI Research Laboratory (GAIR) present SCALE ASSESSMENT, a meta-evaluation framework using multiple communicative LLM agents with an agent debate approach. This system facilitates multi-round discussions, helping human annotators identify the most competent LLMs to evaluate. This approach significantly reduces the burden on annotators, especially in scenarios where many annotations were traditionally required for meta-evaluation.
SCALE ASSESSMENT exploits the multi-agent debate for reliable meta-evaluation of LLMs. In the meta-evaluation process, LLM agents engage in rounds of discussions to evaluate responses based on user-defined criteria. This reduces the need for extensive human annotation and ensures scalability. The evaluation framework involves pairwise response comparisons, focusing on LLMs like gpt-3.5-turbo. The meta-meta-evaluation by a human expert validates the reliability of the proposed method by applying the agent-assisted annotation protocols and by a human expert. This approach balances efficiency with human judgment for accurate and timely assessments.
Studies reveal that LLMs’ performance as raters tends to decline when specific letters in criteria prompts are obscured. Removing leading phrases further reduces effectiveness. Gpt-4-turbo and gpt-3.5-turbo exhibit resilience, maintaining consistent agreement rates across criterion formats. In contrast, Claude-2 displayed confusion and reluctance, particularly when faced with conflicting prompts, rejecting approximately half of the questions. The tested LLMs have difficulty using surrogate criteria information, indicating that there is room for improvement in their design and application despite their advanced capabilities.
In conclusion, the researchers introduced SCALE ASSESSMENT, a scalable meta-evaluation framework using agent-debate support to evaluate LLMs as evaluators. This proposal addresses the inefficiencies of conventional and resource-intensive meta-evaluation methods, which are crucial as the use of LLM expands. The study not only validates the reliability of SCALE ASSESSMENT but also highlights the capabilities and limitations of LLMs in various scenarios. This work helps advance scalable solutions for evaluating LLMs, which is vital for their expanding applications.
Check Paper And GitHub. All credit for this research goes to the researchers of this project. Also don’t forget to follow us on Twitter And Google News. Join our SubReddit 36k+ ML, 41,000+ Facebook communities, Discord ChannelAnd LinkedIn Groops.
If you like our work, you will love our bulletin..
Don’t forget to join our Telegram channel