Periodic agent-state based Q-learning for POMDPs

要約

部分観察可能なマルコフ決定プロセス (POMDP) の標準的なアプローチは、完全に観察された信念状態 MDP に変換することです。
ただし、信念状態はシステムモデルに依存するため、強化学習 (RL) 設定では実行できません。
広く使用されている代替方法は、エージェント状態を使用することです。これはモデルフリーで再帰的に更新可能な観察履歴の関数です。
例としては、フレームスタッキングやリカレントニューラルネットワークなどがあります。
エージェント状態はモデルフリーであるため、標準 RL アルゴリズムを POMDP に適合させるために使用されます。
ただし、Q ラーニングのような標準的な RL アルゴリズムは、定常的なポリシーを学習します。
例を通じて説明する主な理論は、エージェント状態がマルコフ特性を満たさないため、非定常エージェント状態ベースのポリシーが定常ポリシーよりも優れたパフォーマンスを発揮できるということです。
この機能を活用するために、定期的なポリシーを学習するエージェント状態ベースの Q ラーニングの変形である PASQL (定期的なエージェント状態ベースの Q ラーニング) を提案します。
周期的マルコフ連鎖と確率的近似からのアイデアを組み合わせることで、PASQL が循環限界に収束することを厳密に確立し、収束した周期的ポリシーの近似誤差を特徴付けます。
最後に、PASQL の顕著な特徴を強調し、固定ポリシーよりも定期ポリシーを学習する利点を実証する数値実験を紹介します。

要約(オリジナル)

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

arxiv情報

著者	Amit Sinha,Mathieu Geist,Aditya Mahajan
発行日	2024-07-08 16:58:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Periodic agent-state based Q-learning for POMDPs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー