LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

要約

Vision-Language-action（VLA）モデルは、強力なセマンティック理解とゼロショットの一般化を実証していますが、ほとんどの既存のシステムは、エンドエフェクターのポーズやルート速度などの手作りのアクション「語彙」を備えた正確な低レベルコントローラーを想定しています。
この仮定は、以前の研究を準静的タスクに限定し、ヒューマノイド全身制御（WBC）タスクに必要なアジャイルで全身の行動を排除します。
文献のこのギャップを捉えるために、ヒューマノイドWBCの最初のSIMからリアル対応、ビジョン言語、クローズドループベンチマークを導入することから始めます。
次に、レバーブ：潜在的なビジョンエンコードロボット行動、ヒューマノイドビジョン言語WBCの階層的な潜在指導にフォローするフレームワークであるこの種類の提案を提案します。
上位レベルでは、ビジョン言語ポリシーは、合成的にレンダリングされた運動学的デモンストレーションから潜在的なアクションの語彙を学びます。
低レベルでは、強化学習WBCポリシーがこれらの潜在動詞を消費して、ダイナミクスレベルのコマンドを生成します。
ベンチマークでは、レバーはシンプルな視覚ナビゲーションタスクで80％の成功率を達成でき、全体で58.5％の成功率を達成でき、素朴な階層全身VLA実装を7.8倍も上回ります。

要約(オリジナル)

Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action ‘vocabulary’ such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body control (WBC) tasks. To capture this gap in the literature, we start by introducing the first sim-to-real-ready, vision-language, closed-loop benchmark for humanoid WBC, comprising over 150 tasks from 10 categories. We then propose LeVERB: Latent Vision-Language-Encoded Robot Behavior, a hierarchical latent instruction-following framework for humanoid vision-language WBC, the first of its kind. At the top level, a vision-language policy learns a latent action vocabulary from synthetically rendered kinematic demonstrations; at the low level, a reinforcement-learned WBC policy consumes these latent verbs to generate dynamics-level commands. In our benchmark, LeVERB can zero-shot attain a 80% success rate on simple visual navigation tasks, and 58.5% success rate overall, outperforming naive hierarchical whole-body VLA implementation by 7.8 times.

arxiv情報

著者	Haoru Xue,Xiaoyu Huang,Dantong Niu,Qiayuan Liao,Thomas Kragerud,Jan Tommy Gravdahl,Xue Bin Peng,Guanya Shi,Trevor Darrell,Koushil Screenath,Shankar Sastry
発行日	2025-06-16 17:56:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー