TDMPBC: Self-Imitative Reinforcement Learning for Humanoid Robot Control

要約

巧妙な手を備えたヒューマノイドロボットなど、高度な度面および複雑なアクション空間を備えた複雑な高次元空間は、限られたサンプル予算の下で探索と搾取のバランスをとる必要がある強化学習（RL）アルゴリズムに大きな課題をもたらします。
一般に、複雑な高次元空間内でタスクを達成するための実行可能な領域は非常に狭くなっています。
たとえば、ヒューマノイドロボットモーションコントロールのコンテキストでは、空間の大部分は落下に対応しますが、非常に極端な画分のみが直立していることに対応しており、これは下流タスクの完了を助長します。
ロボットが潜在的にタスク関連の地域に探索すると、その地域内のデータをより重視するはずです。
この洞察に基づいて、$ \ textbf {s} $ elf-$ \ textbf {i} $ mitative $ \ textbf {r} $ einforcement $ \ textbf {l} $ hearning（$ \ textbf {sirl} $を提案します。
RLアルゴリズムは、潜在的にタスク関連の軌跡も模倣します。
具体的には、軌跡のリターンを利用してタスクとの関連性を判断し、軌跡のリターンに基づいて重量が動的に調整された追加の動作クローニングが採用されます。
その結果、提案されているアルゴリズムは、5％追加の計算オーバーヘッドで、挑戦的なヒューマノイドベンチで120％のパフォーマンス改善を達成します。
さらなる視覚化により、いくつかのタスクが正常に解決されるという意味のある動作の改善につながることがわかります。

要約(オリジナル)

Complex high-dimensional spaces with high Degree-of-Freedom and complicated action spaces, such as humanoid robots equipped with dexterous hands, pose significant challenges for reinforcement learning (RL) algorithms, which need to wisely balance exploration and exploitation under limited sample budgets. In general, feasible regions for accomplishing tasks within complex high-dimensional spaces are exceedingly narrow. For instance, in the context of humanoid robot motion control, the vast majority of space corresponds to falling, while only a minuscule fraction corresponds to standing upright, which is conducive to the completion of downstream tasks. Once the robot explores into a potentially task-relevant region, it should place greater emphasis on the data within that region. Building on this insight, we propose the $\textbf{S}$elf-$\textbf{I}$mitative $\textbf{R}$einforcement $\textbf{L}$earning ($\textbf{SIRL}$) framework, where the RL algorithm also imitates potentially task-relevant trajectories. Specifically, trajectory return is utilized to determine its relevance to the task and an additional behavior cloning is adopted whose weight is dynamically adjusted based on the trajectory return. As a result, our proposed algorithm achieves 120% performance improvement on the challenging HumanoidBench with 5% extra computation overhead. With further visualization, we find the significant performance gain does lead to meaningful behavior improvement that several tasks are solved successfully.

arxiv情報

著者	Zifeng Zhuang,Diyuan Shi,Runze Suo,Xiao He,Hongyin Zhang,Ting Wang,Shangke Lyu,Donglin Wang
発行日	2025-02-24 16:55:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TDMPBC: Self-Imitative Reinforcement Learning for Humanoid Robot Control

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー