Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

要約

鎖の説明は、大規模な言語モデル（LLM）の決定プロセスを検査し、モデル出力の信頼性を評価するために広く使用されており、LLMと人間の効果的なコラボレーションに重要になっています。
優先最適化（アライメントフェーズの重要なステップ）が、これらの説明の忠実さを誤って減らすことができることを実証します。
これは、アラインメントをガイドする報酬モデル（RM）が、応答の予想される品質と説明の適切性の両方を最適化する（例えば、バイアスの最小化や安全基準に準拠するなど）、潜在的な競合を生み出すために発生します。
RMには、モデルの内部決定プロセスと生成された説明との一貫性を評価するメカニズムがありません。
その結果、LLMは、その推論を正確に反映するのではなく、報酬を最大化するために調整された説明を提供しながら、高度に得点する最終的な応答を生成することにより、「報酬ハッキング」に従事する可能性があります。
この問題に対処するために、RMの入力を予測の因果的な帰属で強化することを提案し、RMが生成された自己実現とモデルの決定プロセスとの間の矛盾を検出できるようにします。
制御された設定では、このアプローチがLLMの傾向を減らして誤解を招く説明を生成することを示します。

要約(オリジナル)

Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization – a key step in the alignment phase – can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model’s internal decision process and the generated explanation. Consequently, the LLM may engage in ‘reward hacking’ by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM’s input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model’s decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.

arxiv情報

著者	Pedro Ferreira,Wilker Aziz,Ivan Titov
発行日	2025-04-07 17:49:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー