Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

要約

Vision-Language-action（VLA）モデルは、ロボット工学の次世代モデルとして登場しています。
ただし、強力な訓練を受けた視覚言語モデル（VLMS）を活用しているにもかかわらず、既存のエンドツーエンドのVLAシステムは、モデルが特定のロボットタスクに適応するため、微調整中に重要な機能を失います。
一般化可能なVLAモデルは、VLMのコアコンピテンシーを維持し、拡張する必要があると主張します。1）オープンワールドの具体化された推論 – VLAはVLMからの知識を継承する必要があります。
この作業では、実用的な推論を可能にしながらVLMの元の強みを維持するように設計された専門的な3段階トレーニングパイプラインと組み合わせた、新規の専門家のVLAモデルと組み合わせたChatVLA-2を紹介します。
アプローチを検証するために、ロボットがホワイトボードに書かれた数学の問題を解釈し、テーブルから対応する番号カードを選択して方程式を解決する数学の一致タスクを設計します。
驚くべきことに、私たちの方法は、これらの能力がVLA内で明示的に訓練されていないにもかかわらず、例外的な数学的推論とOCR機能を示しています。
さらに、VLAは強力な空間推論スキルを持っていることを実証し、以前に見えなかったオブジェクトを含む新しい方向性の指示を解釈できるようにします。
全体として、私たちの方法は、OpenVLA、DexVLA、Pi-Zeroなどの最先端の模倣学習方法を大幅に上回る推論と理解能力を示しています。
この作業は、堅牢な推論能力に恵まれた、真に一般化可能なロボット基礎モデルの開発に向けた実質的な進歩を表しています。

要約(オリジナル)

Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM’s core competencies: 1) Open-world embodied reasoning – the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, capable of solving math problems, possessing visual-spatial intelligence, 2) Reasoning following – effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized three-stage training pipeline designed to preserve the VLM’s original strengths while enabling actionable reasoning. To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and pi-zero. This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities.

arxiv情報

著者	Zhongyi Zhou,Yichen Zhu,Junjie Wen,Chaomin Shen,Yi Xu
発行日	2025-05-28 02:48:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー