Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

要約

最近、大規模言語モデル (LLM) のパフォーマンスを評価するための評価器として LLM を利用することが注目を集めています。
ただし、この種の評価アプローチは LLM の潜在的なバイアスの影響を受けるため、評価結果の精度と信頼性について懸念が生じます。
この問題を軽減するために、LLM の潜在的なバイアスを軽減するために LLM 評価者を支援する 2 つのバージョンのメニーショット ICL プロンプトテンプレートに依存する 2 つのメニーショット ICL プロンプトを提案および検討します。 \textbf{M}any-\textbf{
S}ホット \textbf{w}i 番目の \textbf{R} リファレンス (\textbf{MSwR}) および \textbf{M}any-\textbf{S} ホット with\textbf{o}ut \textbf{R} リファレンス (
\textbf{MSoR})。
具体的には、前者はモデルが生成した根拠をガイダンスとして含むコンテキスト内の例を利用し、後者はそれを利用しません。
設計されたプロンプトに基づいて、コンテキスト内の例の数をスケールした場合の評価結果の一貫性と品質への影響を調査します。
実験結果は、GPT-4o などの高度な LLM は、ゼロショット方式よりもメニーショット方式の方が優れたパフォーマンスを発揮することを示しています。
さらに、LLM の選択バイアスに隠されたシンボルバイアスを明らかにし、そのバイアスを軽減するためのシンプルかつ効果的なアプローチを提案します。
実験結果では、シンボルバイアス軽減アプローチの有効性がさらに検証されています。

要約(オリジナル)

Utilizing Large Language Models (LLMs) as evaluators for evaluating the performance of LLMs has recently garnered attention. However, this kind of evaluation approach is affected by potential biases in LLMs, raising concerns about the accuracy and reliability of the evaluation results. To mitigate this issue, we propose and study two many-shot ICL prompts, which rely on two versions of many-shot ICL prompt templates for helping LLM evaluators to mitigate the potential biases in LLMs, \textbf{M}any-\textbf{S}hot \textbf{w}ith \textbf{R}eference (\textbf{MSwR}) and \textbf{M}any-\textbf{S}hot with\textbf{o}ut \textbf{R}eference (\textbf{MSoR}). Concretely, the former utilizes in-context examples with model-generated rationales as guidance, and the latter without. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Furthermore, we reveal the symbol bias hidden in the selection bias of LLMs and propose a simple yet effective approach to mitigate the bias. Experimental results further verify the effectiveness of the symbol bias mitigation approach.

arxiv情報

著者	Mingyang Song,Mao Zheng,Xuan Luo
発行日	2024-09-17 14:04:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー