TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

要約

この作業では、ERLフレームワークでオフポリシーの更新を可能にする新しいアルゴリズムである、変圧器ベースのポリシーオフポリシーエピソード補強学習（TOP-ERL）を紹介します。
ERLでは、ポリシーは、毎回単一のアクションではなく、複数の時間ステップでアクション全体の軌跡を予測します。
これらの軌道は通常、運動プリミティブ（MP）などの軌跡ジェネレーターによってパラメーター化され、高レベルの時間的相関をキャプチャしながら、長い視野にわたってスムーズで効率的な探索を可能にします。
ただし、ERLメソッドは、アクションシーケンス全体の状態アクション値を評価し、サンプルの効率を制限し、より効率的なポリシーアーキテクチャの使用を防ぐことが難しいため、多くの場合、ポリシーのフレームワークに制約されます。
Top-ERLは、長いアクションシーケンスをセグメント化し、変圧器ベースの批評家アーキテクチャを使用してN-STEPリターン推定を使用して各セグメントの状態アクション値を推定することにより、この欠点に対処します。
これらの貢献により、洗練されたロボット学習環境で行われた経験的結果に反映される効率的で安定したトレーニングが生じます。
Top-erlは、最先端のRLメソッドを大幅に上回ります。
さらに、徹底的なアブレーション研究は、モデルのパフォーマンスに対する主要な設計の選択の影響をさらに示しています。

要約(オリジナル)

This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajectories are typically parameterized by trajectory generators such as Movement Primitives (MP), allowing for smooth and efficient exploration over long horizons while capturing high-level temporal correlations. However, ERL methods are often constrained to on-policy frameworks due to the difficulty of evaluating state-action values for entire action sequences, limiting their sample efficiency and preventing the use of more efficient off-policy architectures. TOP-ERL addresses this shortcoming by segmenting long action sequences and estimating the state-action values for each segment using a transformer-based critic architecture alongside an n-step return estimation. These contributions result in efficient and stable training that is reflected in the empirical results conducted on sophisticated robot learning environments. TOP-ERL significantly outperforms state-of-the-art RL methods. Thorough ablation studies additionally show the impact of key design choices on the model performance.

arxiv情報

著者	Ge Li,Dong Tian,Hongyi Zhou,Xinkai Jiang,Rudolf Lioutikov,Gerhard Neumann
発行日	2025-01-31 16:09:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー