Interpreting Language Reward Models via Contrastive Explanations

要約

報酬モデル (RM) は、大規模言語モデル (LLM) の出力を人間の価値観に合わせる上で重要なコンポーネントです。
RM は、報酬スコアを予測して比較することで、同じプロンプトに対する考えられる LLM 応答に対する人間の好みを近似します。
ただし、RM は通常、スカラー出力ヘッドを備えた LLM の修正バージョンであるため、予測が説明できない大きなブラックボックスです。
より透過的な RM により、LLM の調整における信頼性が向上します。
この研究では、RM によって行われるバイナリ応答の比較を説明するために、対照的な説明を使用することを提案します。
具体的には、RM のローカル動作を特徴付けるために、元の比較と同様のさまざまな新しい比較セットを生成します。
新しい比較を形成する摂動応答は、手動で指定された高レベルの評価属性を明示的に変更するために生成され、これに基づいて RM 動作の分析が行われます。
定量的な実験では、質の高い対照的な説明を見つけるための方法の有効性を検証します。
次に、各評価属性に対する RM の全体的な感受性を調査するための手法の定性的有用性を示し、さまざまな RM の動作を説明および比較するために代表的な例を自動的に抽出する方法を示します。
私たちは、私たちの手法を RM 説明のための柔軟なフレームワークとして捉え、より解釈可能で信頼できる LLM アライメントの基礎を提供します。

要約(オリジナル)

Reward models (RMs) are a crucial component in the alignment of large language models’ (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM’s local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

arxiv情報

著者	Junqi Jiang,Tom Bewley,Saumitra Mishra,Freddy Lecue,Manuela Veloso
発行日	2024-11-25 15:37:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interpreting Language Reward Models via Contrastive Explanations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー