CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

要約

大規模なビジョン言語アクション (VLA) モデルの進歩により、言語ガイドによるタスクの実行と目に見えないシナリオへの一般化という点でロボット操作が大幅に改善されました。
事前学習済みの大規模ビジョン言語モデル (VLM) から適応された既存の VLA は、有望な一般化可能性を実証していますが、さまざまな環境でのタスク成功率の低さが示すように、タスクのパフォーマンスはまだ満足のいくものではありません。
このペーパーでは、VLM から派生した新しい高度な VLA アーキテクチャを紹介します。
単純なアクション量子化によるアクション予測のために VLM を直接再利用する以前の研究とは異なり、VLM 出力に条件付けされた特殊なアクションモジュールを持つ複合化 VLA アーキテクチャを提案します。
私たちはアクションモジュールの設計を系統的に研究し、アクションシーケンスモデリング用の拡散アクショントランスフォーマーによる強力なパフォーマンス向上と、その好ましいスケーリング動作を実証します。
また、さまざまなデザインのモデルの有効性を評価するために、包括的な実験やアブレーション研究も実施しています。
シミュレーションと実際の作業における 5 つのロボットの実施形態の評価は、私たちのモデルがタスクのパフォーマンスにおいて既存の VLA を大幅に上回っているだけでなく、新しいロボットへの顕著な適応と、目に見えないオブジェクトと背景への一般化も示していることを示しています。
これは、当社と同様のモデルサイズ（7B）を持つ OpenVLA の平均成功率を、シミュレーション評価で 35% 以上、実際のロボット実験で 55% 以上上回っています。
また、シミュレーションにおける絶対成功率は 18% も大規模な RT-2-X モデル (55B) よりも優れています。
コードとモデルはプロジェクトページ (https://cogact.github.io/) で見つけることができます。

要約(オリジナル)

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

arxiv情報

著者	Qixiu Li,Yaobo Liang,Zeyu Wang,Lin Luo,Xi Chen,Mozheng Liao,Fangyun Wei,Yu Deng,Sicheng Xu,Yizhong Zhang,Xiaofan Wang,Bei Liu,Jianlong Fu,Jianmin Bao,Dong Chen,Yuanchun Shi,Jiaolong Yang,Baining Guo
発行日	2024-11-29 12:06:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー