現在のモデルベースの手法は、ベルマン更新に保守主義の概念を適用しており、多くの場合、モデル アンサンブルから導出された不確実性推定を使用して実装されています。
この論文では、観察とアクションの結合分布の生成モデルを学習する制約付き潜在アクション ポリシー (C-LAP) を提案します。
我々は、D4RL および V-D4RL ベンチマークで C-LAP を経験的に評価し、C-LAP が最先端の手法に匹敵し、特に視覚的に観察されたデータセットで優れていることを示しました。
In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.
著者 | Marvin Alles,Philip Becker-Ehmck,Patrick van der Smagt,Maximilian Karl |
発行日 | 2025-01-15 13:24:49+00:00 |
arxivサイト | arxiv_id(pdf) |
提供元, 利用サービス
arxiv.jp, Google