PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

要約

現在、さまざまな最新の大規模言語モデル (LLM) によって生成される応答の品質を自動的に評価して比較することは困難です。
最近の研究では、自由回答形式の質問応答の参考資料なしの評価に LLM を使用することが提案されており、主に使用されています。
より具体的には、「最強」と認識されている LLM を評価者として使用し、候補モデルの回答をペアごとに比較し、ランキングスコアを提供します。
ただし、この直感的な方法には、自己強化（自分自身の答えを好む）や立場の偏りをもたらすなど、複数の問題があります。
私たちは、LLM ベースの評価を改善するために、教育領域からの洞察と教訓を引き出しています (Cho & McCarthar、2011; Walsh、2014)。
具体的には、(1) 各ピア LLM のすべての回答ペアのペアごとの優先順位を考慮し、モデルの最終ランキングを出力するピアランク (PR) アルゴリズムを提案します。
(2) ピアディスカッション (PD)。2 つの LLM に議論を促し、2 つの回答の優先順位について相互の合意に達するよう努めます。
2 つのベンチマークデータセットで実験を行います。
私たちのアプローチはより高い精度を達成し、人間の判断とよりよく一致していることがわかりました。
興味深いことに、PR は、各モデルの名前が明らかにされない匿名設定の下で、モデルの比較的正確な自己ランキングを誘導できます。
私たちの研究は、人間にとって比較するのが難しいモデルの評価を検討する余地を提供します。

要約(オリジナル)

Nowadays, the quality of responses generated by different modern large language models (LLMs) is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized ‘strongest’ LLM as the evaluator, which conducts pairwise comparisons of candidate models’ answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho & MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM’s pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model’s name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

arxiv情報

著者	Ruosen Li,Teerth Patel,Xinya Du
発行日	2024-12-31 06:54:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー