$π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

要約

ロボットが役立つためには、実験室以外の現実世界で実質的に関連するタスクを実行する必要があります。
Vision-Language-action（VLA）モデルは、エンドツーエンドのロボット制御の印象的な結果を示していますが、そのようなモデルが野生でどの程度一般化できるかは未解決の疑問のままです。
$ \ pi_ {0.5} $を説明します。$ \ pi_ {0.5} $は、幅広い一般化を可能にするために異種タスクでの共同トレーニングを使用する$ \ pi_ {0} $に基づく新しいモデルを説明します。
$ \ pi_ {0.5} $ \は、複数のロボット、高レベルのセマンティック予測、Webデータ、およびその他のソースからのデータを使用して、広く一般化可能な実際のロボット操作を可能にします。
当社のシステムは、画像観測、言語コマンド、オブジェクト検出、セマンティックサブタスク予測、および低レベルアクションを組み合わせた、共同トレーニングとハイブリッドマルチモーダルの例の組み合わせを使用しています。
私たちの実験は、この種の知識移転が効果的な一般化に不可欠であることを示しており、エンドツーエンドの学習対応ロボットシステムが、まったく新しい家でキッチンや寝室の掃除などの長時間および器用な操作スキルを実行できることを初めて示します。

要約(オリジナル)

In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $\pi_{0.5}$, a new model based on $\pi_{0}$ that uses co-training on heterogeneous tasks to enable broad generalization. $\pi_{0.5}$\ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.

arxiv情報

著者	Physical Intelligence,Kevin Black,Noah Brown,James Darpinian,Karan Dhabalia,Danny Driess,Adnan Esmail,Michael Equi,Chelsea Finn,Niccolo Fusai,Manuel Y. Galliker,Dibya Ghosh,Lachy Groom,Karol Hausman,Brian Ichter,Szymon Jakubczak,Tim Jones,Liyiming Ke,Devin LeBlanc,Sergey Levine,Adrian Li-Bell,Mohith Mothukuri,Suraj Nair,Karl Pertsch,Allen Z. Ren,Lucy Xiaoyang Shi,Laura Smith,Jost Tobias Springenberg,Kyle Stachowicz,James Tanner,Quan Vuong,Homer Walke,Anna Walling,Haohuan Wang,Lili Yu,Ury Zhilinsky
発行日	2025-04-22 17:31:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

$π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー