Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning

要約

私たちは、制約付き凸マルコフ決定プロセス (MDP) を研究します。このプロセスの目標は、凸制約に従って訪問測定の凸関数を最小化することです。
制約付き凸 MDP のアルゴリズムを設計する場合、(1) 大きな状態空間の処理、(2) 探索/活用のトレードオフの管理、(3) 目的と制約が両方とも非線形関数である制約付き最適化の解決など、いくつかの課題に直面します。
訪問措置。
この研究では、モデルベースのアルゴリズムである変分主双対ポリシー最適化 (VPDPO) を紹介します。このアルゴリズムでは、ラグランジュ双対とフェンケル双対性を実装して、元の制約付き問題を制約のない主双対最適化に再定式化します。
さらに、主変数は不確実性に対する楽観主義 (OFU) の原則に従ってモデルベースの値の反復によって更新され、二重変数は勾配上昇によって更新されます。
さらに、訪問測度を有限次元空間に埋め込むことで、関数近似を組み込むことで大きな状態空間を扱うことができます。
2 つの注目すべき例は、(1) カーネル化非線形レギュレーターと (2) 低ランク MDP です。
楽観的な計画のオラクルを使用すると、アルゴリズムが両方のケースで線形未満のリグロングと制約違反を達成し、元の制約付き問題の全体的に最適なポリシーを達成できることを証明します。

要約(オリジナル)

We study the Constrained Convex Markov Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure, subject to a convex constraint. Designing algorithms for a constrained convex MDP faces several challenges, including (1) handling the large state space, (2) managing the exploration/exploitation tradeoff, and (3) solving the constrained optimization where the objective and the constraint are both nonlinear functions of the visitation measure. In this work, we present a model-based algorithm, Variational Primal-Dual Policy Optimization (VPDPO), in which Lagrangian and Fenchel duality are implemented to reformulate the original constrained problem into an unconstrained primal-dual optimization. Moreover, the primal variables are updated by model-based value iteration following the principle of Optimism in the Face of Uncertainty (OFU), while the dual variables are updated by gradient ascent. Moreover, by embedding the visitation measure into a finite-dimensional space, we can handle large state spaces by incorporating function approximation. Two notable examples are (1) Kernelized Nonlinear Regulators and (2) Low-rank MDPs. We prove that with an optimistic planning oracle, our algorithm achieves sublinear regret and constraint violation in both cases and can attain the globally optimal policy of the original constrained problem.

arxiv情報

著者	Zihao Li,Boyi Liu,Zhuoran Yang,Zhaoran Wang,Mengdi Wang
発行日	2024-02-16 16:35:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー