CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation

要約

ロボットでは、言語、目標画像、目標ビデオなど、さまざまなモダリティを通じてタスクの目標を伝えることができます。
ただし、自然言語は曖昧な場合がありますが、画像やビデオは過度に詳細な仕様を提供する場合があります。
これらの課題に取り組むために、包括的なマルチモーダルプロンプトを活用するCrayonroboを紹介し、それを簡単に低レベルのアクションと高レベルの計画の両方を簡単に伝えます。
具体的には、タスクシーケンスの各キーフレームについて、この方法では、RGB画像にオーバーレイされたシンプルで表現力豊かな2D視覚プロンプトの手動または自動生成が可能になります。
これらのプロンプトは、エンドエフェクターのポーズや接触後の望ましい動きの方向など、必要なタスク目標を表します。
モデルがこれらの視覚言語プロンプトを解釈し、SE（3）スペースの対応する接触ポーズと移動方向を予測できるようにするトレーニング戦略を開発します。
さらに、すべてのキーフレームステップを順次実行することにより、モデルは長老のタスクを完了することができます。
このアプローチは、モデルがタスクの目的を明示的に理解するのに役立つだけでなく、簡単に解釈できるプロンプトを提供することにより、目に見えないタスクの堅牢性を高めます。
シミュレートされた環境と現実世界の両方の環境での方法を評価し、その堅牢な操作能力を実証します。

要約(オリジナル)

In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner. Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space. Furthermore, by sequentially executing all key-frame steps, the model can complete long-horizon tasks. This approach not only helps the model explicitly understand the task objectives but also enhances its robustness on unseen tasks by providing easily interpretable prompts. We evaluate our method in both simulated and real-world environments, demonstrating its robust manipulation capabilities.

arxiv情報

著者	Xiaoqi Li,Lingyun Xu,Mingxu Zhang,Jiaming Liu,Yan Shen,Iaroslav Ponomarenko,Jiahui Xu,Liang Heng,Siyuan Huang,Shanghang Zhang,Hao Dong
発行日	2025-05-04 15:58:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー