Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

要約

言語モデル（LMS）の不確実性の定量化（UQ）は、安全性と信頼性を改善するために重要です。
多くの場合、AUROCなどのパフォーマンスメトリックを使用して、UQメソッド（たとえば、負のシーケンス確率）がタスクの正しさ関数（Rouge-Lなど）とどの程度うまく相関しているかを評価します。
この論文では、特定のUQメソッドのパフォーマンスを膨らませることにより、一般的に使用される正確性関数バイアスUQ評価をバイアスすることを示します。
字句ベースの埋め込みベースのメトリックからLLM-As-a-Judgeアプローチまで、4つのデータセットx 4モデルx 6 UQメソッドを越えて、7つの正しさ関数を評価します。
私たちの分析では、これらの正しさ関数の誤差の長さのバイアスは、UQメソッドの長さバイアスと相互作用することにより、UQ評価を歪ませることが明らかになりました。
LLM-as-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-dudgeアプローチは、これらのバイアスを緩和する潜在的な解決策を特定します。

要約(オリジナル)

Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions — from lexical-based and embedding-based metrics to LLM-as-a-judge approaches — across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

arxiv情報

著者	Andrea Santilli,Adam Golinski,Michael Kirchhof,Federico Danieli,Arno Blaas,Miao Xiong,Luca Zappella,Sinead Williamson
発行日	2025-04-18 13:13:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー