ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions

要約

さまざまなプロンプトと選択で構成される複数選択ベンチマークは、言語モデルの自然言語理解能力を評価するための最も広く使用されている方法の1つです。
特定のプロンプトが与えられた場合、通常、$ P（選択|プロンプト）$を計算して、言語モデルが誤ったものと比較して正しい選択を生成する可能性を評価します。
ただし、このアプローチを使用して測定されたパフォーマンスは、プロンプトのモデルの理解だけでなく、プロンプトに関係なく特定の選択に固有のバイアスにも反映されることがわかります。
この問題により、モデルはプロンプトを完全に理解せずに答えを選択する可能性があるため、モデルの自然言語の理解を正確に測定することが困難になります。
この制限に対処するために、ANPMIと呼ばれる新しいメトリックを提案します。ANPMIは、PointWise相互情報（PMI）を$ – \ log P（Choice）$で正規化します。
ANPMIは、プロンプトを適切に理解せずに質問に答えることが困難であることを確認することにより、モデルの自然言語の理解をより正確に評価します。

要約(オリジナル)

Multiple-choice benchmarks, consisting of various prompts and choices, are among the most widely used methods to assess a language model’s natural language understanding capability. Given a specific prompt, we typically compute $P(Choice|Prompt)$ to evaluate how likely a language model is to generate the correct choice compared to incorrect ones. However, we observe that performance measured using this approach reflects not only the model’s comprehension of the prompt but also its inherent biases for certain choices regardless of the prompt. This issue makes it challenging to accurately measure a model’s natural language understanding, as models may select the answer without fully understanding the prompt. To address this limitation, we propose a novel metric called ANPMI, which normalizes Pointwise Mutual Information (PMI) by $-\log P(Choice)$. ANPMI provides a more accurate assessment of the model’s natural language understanding by ensuring that it is challenging to answer a question without properly understanding the prompt.

arxiv情報

著者	Gyeongje Cho,Yeonkyoung So,Jaejin Lee
発行日	2025-03-12 16:27:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー