Explaining Learned Reward Functions with Counterfactual Trajectories

要約

人間の行動やフィードバックから報酬を学習することは、AI システムを人間の価値観に合わせるための有望なアプローチですが、正しい報酬関数を一貫して抽出することはできません。
解釈ツールを使用すると、ユーザーは学習した報酬関数の潜在的な欠陥を理解し、評価できるようになります。
我々は、元の軌道と反事実の部分軌道、およびそれぞれが受け取る報酬を対比することにより、強化学習における報酬関数を解釈する反事実軌道説明 (CTE) を提案します。
我々は、CTE の 6 つの品質基準を導出し、これらの品質基準を最適化する CTE を生成するための新しいモンテカルロベースのアルゴリズムを提案します。
最後に、CTE でトレーニングすることにより、生成された説明がプロキシヒューマンモデルに対してどの程度有益であるかを測定します。
CTE は明らかに人間代理モデルにとって有益であり、その予測と目に見えない軌道上の報酬関数との類似性を高めます。
さらに、軌道間の報酬の違いを正確に判断することを学習し、分布外の例に一般化します。
CTE は報酬の完全な理解にはつながりませんが、私たちの方法、より一般的には XAI 方法の適応は、学習された報酬関数を解釈するための実りあるアプローチとして提示されます。

要約(オリジナル)

Learning rewards from human behaviour or feedback is a promising approach to aligning AI systems with human values but fails to consistently extract correct reward functions. Interpretability tools could enable users to understand and evaluate possible flaws in learned reward functions. We propose Counterfactual Trajectory Explanations (CTEs) to interpret reward functions in reinforcement learning by contrasting an original with a counterfactual partial trajectory and the rewards they each receive. We derive six quality criteria for CTEs and propose a novel Monte-Carlo-based algorithm for generating CTEs that optimises these quality criteria. Finally, we measure how informative the generated explanations are to a proxy-human model by training it on CTEs. CTEs are demonstrably informative for the proxy-human model, increasing the similarity between its predictions and the reward function on unseen trajectories. Further, it learns to accurately judge differences in rewards between trajectories and generalises to out-of-distribution examples. Although CTEs do not lead to a perfect understanding of the reward, our method, and more generally the adaptation of XAI methods, are presented as a fruitful approach for interpreting learned reward functions.

arxiv情報

著者	Jan Wehner,Frans Oliehoek,Luciano Cavalcante Siebert
発行日	2024-02-07 13:54:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Explaining Learned Reward Functions with Counterfactual Trajectories

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー