Time-Efficient Reinforcement Learning with Stochastic Stateful Policies

要約

ステートフルポリシーは、部分的に観察可能な環境の処理、堅牢性の強化、またはポリシー構造への帰納的バイアスの直接の強制など、強化学習において重要な役割を果たします。
ステートフルポリシーをトレーニングする従来の方法は Backpropagation Through Time (BPTT) ですが、これには、連続的な勾配の伝播や勾配の消失または爆発の発生によるトレーニングの遅延などの重大な欠点があります。
これらの問題に対処するために勾配が切り捨てられることがよくあり、その結果、偏ったポリシー更新が行われます。
我々は、ステートフルポリシーを確率的内部ステートカーネルとステートレスポリシーに分解し、ステートフルポリシーの勾配に従って共同最適化することで、ステートフルポリシーをトレーニングするための新しいアプローチを提案します。
ステートフルポリシー勾配定理のさまざまなバージョンを導入し、一般的な強化学習および模倣学習アルゴリズムのステートフルバリアントを簡単にインスタンス化できるようにします。
さらに、新しい勾配推定器の理論的分析を提供し、それを BPTT と比較します。
我々は、ヒューマノイドの移動などの複雑な連続制御タスクに対するアプローチを評価し、BPTT に代わるより高速でシンプルな代替手段を提供しながら、勾配推定器がタスクの複雑さに効果的に対応することを実証します。

要約(オリジナル)

Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we provide a theoretical analysis of our new gradient estimator and compare it with BPTT. We evaluate our approach on complex continuous control tasks, e.g., humanoid locomotion, and demonstrate that our gradient estimator scales effectively with task complexity while offering a faster and simpler alternative to BPTT.

arxiv情報

著者	Firas Al-Hafez,Guoping Zhao,Jan Peters,Davide Tateo
発行日	2023-11-07 15:48:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Time-Efficient Reinforcement Learning with Stochastic Stateful Policies

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー