Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers

要約

最新のディープポリシーグラディエントメソッドは、シミュレートされたロボットタスクで効果的なパフォーマンスを実現しますが、それらはすべて、大規模なリプレイバッファーまたは高価なバッチアップデート、またはその両方を必要とするため、リソース制限コンピューターを備えた実際のシステムに互換性があります。
これらの方法は、小さなリプレイバッファーに制限されている場合、または更新中にバッチアップデートやリプレイバッファーなしで最新のサンプルのみを使用する場合、壊滅的に失敗することを示します。
漸進的な学習における不安定性の課題に対処するための、アクションバリューグラデーション（AVG）と一連の正規化とスケーリング手法 – 新しいインクリメンタルディープポリシーグラデーション法を提案します。
ロボットシミュレーションベンチマークでは、AVGが効果的に学習する唯一の増分方法であり、多くの場合、バッチポリシーグラデーションメソッドに匹敵する最終パフォーマンスを達成することが多いことを示します。
この進歩により、ロボットマニピュレーターとモバイルロボットを使用して、インクリメンタルアップデートのみを使用して、実際のロボットで効果的な深い補強学習を初めて表示することができました。

要約(オリジナル)

Modern deep policy gradient methods achieve effective performance on simulated robotic tasks, but they all require large replay buffers or expensive batch updates, or both, making them incompatible for real systems with resource-limited computers. We show that these methods fail catastrophically when limited to small replay buffers or during incremental learning, where updates only use the most recent sample without batch updates or a replay buffer. We propose a novel incremental deep policy gradient method — Action Value Gradient (AVG) and a set of normalization and scaling techniques to address the challenges of instability in incremental learning. On robotic simulation benchmarks, we show that AVG is the only incremental method that learns effectively, often achieving final performance comparable to batch policy gradient methods. This advancement enabled us to show for the first time effective deep reinforcement learning with real robots using only incremental updates, employing a robotic manipulator and a mobile robot.

arxiv情報

著者	Gautham Vasan,Mohamed Elsayed,Alireza Azimi,Jiamin He,Fahim Shariar,Colin Bellinger,Martha White,A. Rupam Mahmood
発行日	2025-05-21 05:30:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー