$π_0$: A Vision-Language-Action Flow Model for General Robot Control

要約

ロボット学習は、柔軟で汎用的かつ器用なロボットシステムの可能性を最大限に引き出し、人工知能における最も深い問題のいくつかに取り組む上で、非常に大きな可能性を秘めている。しかし、ロボット学習を効果的な実世界システムに必要な一般性のレベルまで高めることは、データ、一般化、ロバスト性の面で大きな障害に直面している。本論文では、一般論的ロボット政策（すなわちロボット基礎モデル）がどのようにこれらの課題に対処できるか、また、複雑で高度に器用なタスクのための効果的な一般論的ロボット政策をどのように設計できるかについて議論する。我々は、インターネットスケールの意味知識を継承するために、事前に訓練された視覚言語モデル（VLM）の上に構築された新しいフローマッチングアーキテクチャを提案する。次に、このモデルを、単腕ロボット、双腕ロボット、移動マニピュレータを含む、複数の器用ロボットプラットフォームからの大規模かつ多様なデータセットでどのように学習させることができるかを議論する。我々は、事前学習後にゼロショットでタスクを実行する能力、人や高レベルのVLMポリシーからの言語指示に従う能力、および微調整により新しいスキルを獲得する能力の観点から、本モデルを評価する。その結果、洗濯物の折り畳み、テーブルの掃除、箱の組み立てなど、様々なタスクをカバーすることができた。

要約(オリジナル)

Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

arxiv情報

著者	Kevin Black,Noah Brown,Danny Driess,Adnan Esmail,Michael Equi,Chelsea Finn,Niccolo Fusai,Lachy Groom,Karol Hausman,Brian Ichter,Szymon Jakubczak,Tim Jones,Liyiming Ke,Sergey Levine,Adrian Li-Bell,Mohith Mothukuri,Suraj Nair,Karl Pertsch,Lucy Xiaoyang Shi,James Tanner,Quan Vuong,Anna Walling,Haohuan Wang,Ury Zhilinsky
発行日	2024-11-02 04:00:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

$π_0$: A Vision-Language-Action Flow Model for General Robot Control

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー