Interpreting Language Reward Models via Contrastive Explanations

要約

報酬モデル（RMS）は、人間の価値を持つ大規模な言語モデル（LLMS）出力のアライメントにおける重要なコンポーネントです。
RMSは、報酬スコアを予測および比較することにより、同じプロンプトに対するLLM応答の可能性よりも人間の好みを近似します。
ただし、通常はスカラー出力ヘッドを備えたLLMSの変更されたバージョンであるため、RMSは予測が説明できない大きなブラックボックスです。
より透明なRMSにより、LLMSのアライメントに対する信頼の改善が可能になります。
この作業では、RMによって行われたバイナリ応答の比較を説明するために、対照的な説明を使用することを提案します。
具体的には、RMのローカルな動作を特徴付けるために、元の比較と同様の新しい比較の多様なセットを生成します。
新しい比較を形成する混乱した応答は、RMの動作の分析が接地されている手動で指定された高レベルの評価属性を明示的に変更するために生成されます。
定量的実験では、高品質の対照的な説明を見つけるための方法の有効性を検証します。
次に、各評価属性に対するRMSのグローバルな感度を調査するための方法の定性的有用性を紹介し、異なるRMSの行動を説明および比較するために代表的な例を自動的に抽出する方法を示します。
私たちの方法は、RM説明の柔軟なフレームワークと考えており、より解釈可能で信頼できるLLMアライメントの基礎を提供します。

要約(オリジナル)

Reward models (RMs) are a crucial component in the alignment of large language models’ (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM’s local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

arxiv情報

著者	Junqi Jiang,Tom Bewley,Saumitra Mishra,Freddy Lecue,Manuela Veloso
発行日	2025-02-26 16:46:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interpreting Language Reward Models via Contrastive Explanations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー