Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

要約

大規模な言語モデル（LLM）が進化し続けるにつれて、それらを評価することは依然として永続的な課題です。
最近の多くの評価では、LLMSを審査員として使用して、他のLLMからの出力を採点し、GPT-4Oのような単一の大型モデルに依存することがよくあります。
ただし、単一のLLMジャッジを使用することはモデル内のバイアスを起こしやすく、感情的知性、創造的な執筆、説得力に関連するタスクは、単一のモデルが公正に判断するにはあまりにも主観的である可能性があります。
LLMSのグループが協力してテストを作成し、それらに応答し、お互いの回答を評価して民主的な方法でランキングを作成するために協力する言語モデル評議会（LMC）を紹介します。
小規模なモデルのパネルを使用してコストやバイアスの削減に焦点を当てた以前のアプローチとは異なり、当社の作業では、完全に包括的なLLM評価システムの利点とニュアンスを調べます。
感情的知性に関する詳細なケーススタディでは、20の最近のLLMSの評議会を展開して、対人紛争に対する自由回答形式の反応について互いにランク付けします。
私たちの結果は、LMCがより分離可能でより堅牢なランキングを生成していることを示しており、ユーザー調査を通じて、個々のLLM裁判官よりも人間の評価と一致していることが示されています。
ただし、すべてのLLMを審査に使用すると費用がかかる可能性があるため、モンテカルロシミュレーションと手curateされたサブカウンティを使用して、仮想評議会の構成を研究し、増分LLMジャッジの価値を議論します。

要約(オリジナル)

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks – such as those related to emotional intelligence, creative writing, and persuasiveness – may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other’s responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.

arxiv情報

著者	Justin Zhao,Flor Miriam Plaza-del-Arco,Benjie Genchel,Amanda Cercas Curry
発行日	2025-02-11 18:42:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー