TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

要約

広範なロボットデータセットで事前トレーニングされた大規模なビジョン言語アクション (VLA) モデルは、ロボット学習のための有望なジェネラリストポリシーを提供しますが、インタラクティブロボット工学における時空間ダイナミクスに依然として苦戦しており、操作などの複雑なタスクを処理する際の効果が低くなります。
この研究では、状態アクションの軌跡を視覚的にエンコードすることで、VLA モデルのアクション予測のための時空間認識を促進するシンプルかつ効果的なアプローチであるビジュアルトレースプロンプティングを紹介します。
視覚的なトレースプロンプトを使用して、独自に収集した 150,000 個のロボット操作軌跡のデータセットに基づいて OpenVLA を微調整することにより、新しい TraceVLA モデルを開発します。
SimplerEnv での 137 の構成と物理 WidowX ロボットの 4 つのタスクにわたる TraceVLA の評価では、最先端のパフォーマンスが実証され、SimplerEnv では OpenVLA を 10%、実際のロボットタスクでは 3.5 倍上回り、さまざまな実施形態とシナリオにわたって堅牢な一般化が示されています。
。
私たちの方法の有効性と一般性をさらに検証するために、Open-X-Embodiment で事前トレーニングされ、データセットで微調整された 4B Phi-3-Vision に基づくコンパクトな VLA モデルを紹介します。これは、推論効率を大幅に向上させながら、7B OpenVLA ベースラインに匹敵します。
。

要約(オリジナル)

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models’ spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

arxiv情報

著者	Ruijie Zheng,Yongyuan Liang,Shuaiyi Huang,Jianfeng Gao,Hal Daumé III,Andrey Kolobov,Furong Huang,Jianwei Yang
発行日	2024-12-13 18:40:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー