DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

要約

大規模な言語モデル（LLMS）の自由形式生成応答を評価することは、それらの多様でオープンエンドの性質のため、依然として課題のままです。
従来の監視されている信号ベースの自動メトリックは、セマンティックの等価性をキャプチャしたり、自由回答形式の応答の変動性を処理したりすることはできませんが、人間の評価は信頼できますが、リソース集約型です。
評価者としてLLMを活用すると、言語の理解と指導に従う能力が強いため、有望な代替手段が提供されます。
これらの機能を活用して、2つの主要なLLM-Judgesを採用し、意見の相違の場合にのみ3番目の仲裁人に関与する、評価のための動的な仲裁フレームワーク（DAFE）を提案します。
この選択的仲裁は、従来の多数決と比較して不必要な計算需要を減らしながら、評価の信頼性を優先します。
Dafeは、動的な仲裁でタスク固有の参照回答を使用して、判断の精度を高めるため、Macro F1やCohen’s Kappaなどの評価メトリックが大幅に改善されます。
包括的な人間の評価を含む実験を通じて、一貫したスケーラブルでリソース効率の高い評価を提供するDafeの能力を実証し、フリーフォームモデル出力を評価するための堅牢なフレームワークとしてそれを確立します。

要約(オリジナル)

Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen’s Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE’s ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.

arxiv情報

著者	Sher Badshah,Hassan Sajjad
発行日	2025-03-11 15:29:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー