Evaluating Large Language Models at Evaluating Instruction Following

要約

大規模言語モデル (LLM) の研究が加速し続けるにつれて、LLM ベースの評価は、増え続けるモデルのリストを比較するための、人間による評価に代わるスケーラブルでコスト効率の高い代替手段として浮上しています。
この論文では、これらの「LLM エバリュエーター」の有効性、特に、生成されたテキストが指定された指示にどれだけ忠実に準拠しているかを測定する指標である指示フォローを評価するために使用する場合について調査します。
指示に従う出力を識別する際の LLM 評価者の能力をテストするために設計された、挑戦的なメタ評価ベンチマーク LLMBar を紹介します。
著者らは手動で 419 組の出力を厳選し、一方は指示に準拠し、もう一方は発散していますが、より魅力的な口調など、LLM 評価者を誤解させる欺瞞的な性質を備えている可能性があります。
既存のメタ評価とは対照的に、さまざまな評価者 (つまり、LLM とプロンプトの組み合わせ) が LLMBar で異なるパフォーマンスを示し、最高スコアの評価者であっても改善の余地がかなりあることがわかりました。
また、LLM と人間の評価者の間のギャップをさらに埋める、一連の新しいプロンプト戦略も紹介します。
LLMBar を使用することで、LLM 評価者に関するさらなる洞察を提供し、より優れた命令追従モデルの開発における将来の研究を促進したいと考えています。

要約(オリジナル)

As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ‘LLM evaluators’, particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.

arxiv情報

著者	Zhiyuan Zeng,Jiatong Yu,Tianyu Gao,Yu Meng,Tanya Goyal,Danqi Chen
発行日	2023-10-11 16:38:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Large Language Models at Evaluating Instruction Following

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー