A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

要約

強化学習エージェントは超人的なパフォーマンスを達成できますが、彼らの決定はしばしば解釈が困難です。
この透明性の欠如は、特に人間の信頼と説明責任が不可欠な安全性の高い設定での展開を制限します。
この作業では、エージェントがその環境で観察するものを表す状態特徴の影響を通して強化学習を説明するための理論的枠組みを開発します。
説明から利益を得るエージェントと環境の相互作用の3つのコア要素を特定します：行動（エージェントが行うこと）、エージェントが達成すること）、および値推定（エージェントが達成することを期待するもの）。
状態機能を協力して各要素を生成し、協同組合ゲーム理論から原則的な方法であるShapley値を適用して、各機能の影響を特定します。
このアプローチは、明確なセマンティクスと理論的保証を使用して、数学的に根拠のある説明の家族をもたらします。
これらの説明が人間の直観とどのように整合するかを示すために、例示的な例を使用し、新しい洞察を明らかにします。
私たちのフレームワークは、以前の作業を統合し、拡張し、既存のアプローチの背後にある仮定を明示し、より解釈可能で信頼できる強化学習のための原則的な基盤を提供します。

要約(オリジナル)

Reinforcement learning agents can achieve superhuman performance, but their decisions are often difficult to interpret. This lack of transparency limits deployment, especially in safety-critical settings where human trust and accountability are essential. In this work, we develop a theoretical framework for explaining reinforcement learning through the influence of state features, which represent what the agent observes in its environment. We identify three core elements of the agent-environment interaction that benefit from explanation: behaviour (what the agent does), performance (what the agent achieves), and value estimation (what the agent expects to achieve). We treat state features as players cooperating to produce each element and apply Shapley values, a principled method from cooperative game theory, to identify the influence of each feature. This approach yields a family of mathematically grounded explanations with clear semantics and theoretical guarantees. We use illustrative examples to show how these explanations align with human intuition and reveal novel insights. Our framework unifies and extends prior work, making explicit the assumptions behind existing approaches, and offers a principled foundation for more interpretable and trustworthy reinforcement learning.

arxiv情報

著者	Daniel Beechey,Thomas M. S. Smith,Özgür Şimşek
発行日	2025-05-12 17:48:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー