Reinformer: Max-Return Sequence Modeling for offline RL

要約

データ駆動型のパラダイムとして、オフライン強化学習 (RL) は、リターン、目標、将来の軌道などの事後情報を条件とするシーケンスモデリングとして定式化されています。
この教師ありパラダイムは有望ではありますが、収益を最大化するという RL の中核的な目的を見落としています。
この見落としは、軌道ステッチング機能の欠如に直接つながり、次善のデータから学習するシーケンスモデルに影響を与えます。
この研究では、収益を最大化するという目標を既存のシーケンスモデルに統合する、最大収益シーケンスモデリングの概念を導入します。
我々は、シーケンスモデルが RL 目的によって強化されていることを示す、強化トランスフォーマー (Reinformer) を提案します。
Reinformer には、トレーニング段階で利益を最大化するという目的も組み込まれており、分布内で将来の最大利益を予測することを目的としています。
推論中、この分布内最大収益は、最適なアクションの選択をガイドします。
経験的に、Reinformer は D4RL ベンチマークで古典的な RL 手法と競合し、特に軌道ステッチング能力において最先端のシーケンスモデルを上回っています。
コードは \url{https://github.com/Dragon-Zhuang/Reinformer} で公開されています。

要約(オリジナル)

As a data-driven paradigm, offline reinforcement learning (RL) has been formulated as sequence modeling that conditions on the hindsight information including returns, goal or future trajectory. Although promising, this supervised paradigm overlooks the core objective of RL that maximizes the return. This overlook directly leads to the lack of trajectory stitching capability that affects the sequence model learning from sub-optimal data. In this work, we introduce the concept of max-return sequence modeling which integrates the goal of maximizing returns into existing sequence models. We propose Reinforced Transformer (Reinformer), indicating the sequence model is reinforced by the RL objective. Reinformer additionally incorporates the objective of maximizing returns in the training phase, aiming to predict the maximum future return within the distribution. During inference, this in-distribution maximum return will guide the selection of optimal actions. Empirically, Reinformer is competitive with classical RL methods on the D4RL benchmark and outperforms state-of-the-art sequence model particularly in trajectory stitching ability. Code is public at \url{https://github.com/Dragon-Zhuang/Reinformer}.

arxiv情報

著者	Zifeng Zhuang,Dengyun Peng,jinxin Liu,Ziqi Zhang,Donglin Wang
発行日	2024-05-14 16:30:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reinformer: Max-Return Sequence Modeling for offline RL

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー