Automatic Reward Shaping from Confounded Offline Data

要約

人工知能の重要なタスクは、不明な環境でエージェントを制御するための効果的なポリシーを学習して、パフォーマンス測定を最適化することです。
Qラーニングなどのポリシー外学習方法により、学習者は過去の経験に基づいて最適な決定を下すことができます。
このホワイトペーパーでは、複雑なドメインおよび高次元のドメインの偏ったデータから学習を研究しています。
よく知られている深いQネットワーク（DQN）に基づいて、観察されたデータの交絡バイアスに対して堅牢に堅牢になる新しいディープ強化学習アルゴリズムを提案します。
具体的には、私たちのアルゴリズムは、観測と互換性のある最悪の環境の安全なポリシーを見つけようとします。
私たちは、12の混乱したAtariゲームに方法を適用し、行動とターゲットのポリシーへの観察された入力が不一致と観察されていない交絡因子が存在するすべてのゲームで標準のDQNを一貫して支配することがわかります。

要約(オリジナル)

A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.

arxiv情報

著者	Mingxuan Li,Junzhe Zhang,Elias Bareinboim
発行日	2025-05-16 17:40:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Automatic Reward Shaping from Confounded Offline Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー