$π_0$: A Vision-Language-Action Flow Model for General Robot Control

要約

ロボット学習は、柔軟で汎用性があり、器用なロボットシステムの可能性を最大限に引き出し、人工知能における最も深い疑問のいくつかに対処するという大きな可能性を秘めています。
しかし、ロボット学習を実世界の効果的なシステムに必要な汎用性のレベルに引き上げるには、データ、一般化、堅牢性の点で大きな障害に直面します。
この論文では、ジェネラリストロボットポリシー (つまり、ロボット基盤モデル) がこれらの課題にどのように対処できるか、また複雑で高度に器用なタスクに対して効果的なジェネラリストロボットポリシーをどのように設計できるかについて説明します。
私たちは、インターネット規模のセマンティック知識を継承するために、事前トレーニングされたビジョン言語モデル (VLM) の上に構築された新しいフローマッチングアーキテクチャを提案します。
次に、単腕ロボット、双腕ロボット、モバイルマニピュレーターなど、複数の器用なロボットプラットフォームからの大規模で多様なデータセットでこのモデルをトレーニングする方法について説明します。
私たちは、事前トレーニング後にゼロショットでタスクを実行する能力、人や高レベルの VLM ポリシーからの言語指示に従う能力、および微調整を通じて新しいスキルを習得する能力の観点からモデルを評価します。
私たちの実績は、洗濯物のたたみ、テーブルの掃除、箱の組み立てなど、さまざまな作業をカバーしています。

要約(オリジナル)

Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

arxiv情報

著者	Kevin Black,Noah Brown,Danny Driess,Adnan Esmail,Michael Equi,Chelsea Finn,Niccolo Fusai,Lachy Groom,Karol Hausman,Brian Ichter,Szymon Jakubczak,Tim Jones,Liyiming Ke,Sergey Levine,Adrian Li-Bell,Mohith Mothukuri,Suraj Nair,Karl Pertsch,Lucy Xiaoyang Shi,James Tanner,Quan Vuong,Anna Walling,Haohuan Wang,Ury Zhilinsky
発行日	2024-11-13 17:30:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

$π_0$: A Vision-Language-Action Flow Model for General Robot Control

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー