Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning

要約

データ駆動型パラダイムとして、オフライン強化学習 (Offline RL) はシーケンスモデリングとして定式化されており、意思決定変換器 (DT) が優れた機能を実証しています。
値関数を適合させたり、ポリシー勾配を計算したりする以前の強化学習手法とは異なり、DT は、最適なアクションを出力するために因果的にマスクされた Transformer を使用して、期待される収益、過去の状態、およびアクションに基づいて自己回帰モデルを調整します。
ただし、単一の軌道内でサンプリングされたリターンと複数の軌道にわたる最適なリターンの間には一貫性がないため、最適なアクションを出力し、次善の軌道をつなぎ合わせるために期待されるリターンを設定するのは困難です。
Decision ConvFormer (DC) は、DT と比較して、マルコフ決定プロセス内の RL 軌跡をモデル化するという文脈で理解しやすいです。
我々は、DC による RL 軌跡の理解を組み合わせ、トレーニング中に動的計画法を使用してアクション値を最大化する項を組み込む、Q 値正則化決定コンフォーマー (QDC) を提案します。
これにより、サンプリングされたアクションの期待収益が最適収益と一致することが保証されます。
QDC は、D4RL ベンチマークで優れたパフォーマンスを達成し、テストされたすべての環境で最適レベルを上回るか、最適レベルに近づきます。
特に軌道縫合能力において優れた競争力を発揮します。

要約(オリジナル)

As a data-driven paradigm, offline reinforcement learning (Offline RL) has been formulated as sequence modeling, where the Decision Transformer (DT) has demonstrated exceptional capabilities. Unlike previous reinforcement learning methods that fit value functions or compute policy gradients, DT adjusts the autoregressive model based on the expected returns, past states, and actions, using a causally masked Transformer to output the optimal action. However, due to the inconsistency between the sampled returns within a single trajectory and the optimal returns across multiple trajectories, it is challenging to set an expected return to output the optimal action and stitch together suboptimal trajectories. Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process compared to DT. We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values using dynamic programming methods during training. This ensures that the expected returns of the sampled actions are consistent with the optimal returns. QDC achieves excellent performance on the D4RL benchmark, outperforming or approaching the optimal level in all tested environments. It particularly demonstrates outstanding competitiveness in trajectory stitching capability.

arxiv情報

著者	Teng Yan,Zhendong Ruan,Yaobang Cai,Yu Han,Wenxian Li,Yang Zhang
発行日	2024-09-12 14:10:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー