DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning

要約

この論文では、深層強化と模倣学習のアクションを生成するためのポリシー表現として微分可能軌道最適化を利用する DiffTOP を紹介します。
軌道の最適化は、制御において強力で広く使用されているアルゴリズムであり、コストとダイナミクス関数によってパラメーター化されます。
私たちのアプローチの鍵は、微分可能な軌道最適化における最近の進歩を活用することです。これにより、軌道最適化のパラメーターに関する損失の勾配を計算できるようになります。
その結果、軌道最適化のコスト関数とダイナミクス関数をエンドツーエンドで学習できるようになります。
DiffTOP のダイナミクスモデルは、軌道最適化プロセスを通じてポリシー勾配損失を微分することでタスクのパフォーマンスを直接最大化するように学習されるため、DiffTOP は以前のモデルベースの RL アルゴリズムの「客観的不一致」問題に対処します。
さらに、高次元の感覚観察による標準的なロボット操作タスクスイートでの模倣学習用の DiffTOP のベンチマークを行い、私たちの方法をフィードフォワードポリシークラス、エネルギーベースモデル (EBM) および拡散と比較します。
DiffTOP は、15 のモデルベースの RL タスクと、高次元の画像と点群の入力による 13 の模倣学習タスクにわたって、両方の領域で従来の最先端の手法を上回りました。

要約(オリジナル)

This paper introduces DiffTOP, which utilizes Differentiable Trajectory OPtimization as the policy representation to generate actions for deep reinforcement and imitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTOP addresses the “objective mismatch” issue of prior model-based RL algorithms, as the dynamics model in DiffTOP is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTOP for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feed-forward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model-based RL tasks and 13 imitation learning tasks with high-dimensional image and point cloud inputs, DiffTOP outperforms prior state-of-the-art methods in both domains.

arxiv情報

著者	Weikang Wan,Yufei Wang,Zackory Erickson,David Held
発行日	2024-02-08 05:26:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー