BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

要約

最近、ビルディングビジョン言語アクション（VLA）モデルの事前訓練を受けたビジョン言語モデル（VLM）を活用することは、効果的なロボット操作学習に対する有望なアプローチとして浮上しました。
ただし、アクション予測のためにVLMSに3D信号を組み込む方法はほとんどなく、3Dデータに固有の空間構造を完全に活用せず、サンプル効率が低くなります。
このホワイトペーパーでは、（1）3D入力を複数の2D画像に投影し、VLMバックボーンとの入力アライメントを確保する新しい3D VLAモデルであるBridgeVLAを紹介し、（2）アクション予測のために2Dヒートマップを使用し、一貫した2D画像スペース内の入力と出力スペースを統合します。
さらに、下流のポリシー学習の前に2Dヒートマップを予測する機能をVLMバックボーンに装備するスケーラブルなトレーニング方法を提案します。
広範な実験は、提案された方法が3D操作を効率的かつ効果的に学ぶことができることを示しています。
Bridgevlaは、3つのシミュレーションベンチマークにわたって最先端のベースラインメソッドを上回ります。
RLBenchでは、平均成功率が81.4％から88.2％に改善されます。
コロッセオでは、一般化に挑戦する状況で大幅に優れたパフォーマンスを示し、平均成功率を56.7％から64.0％に引き上げます。
Gembenchでは、平均成功率の観点から、すべての比較ベースライン方法を上回ります。
Real-Robot実験では、Bridgevlaは平均して最先端のベースライン方法を32％上回ります。
視覚障害や目に見えない指示を含む、複数の分散式設定で堅牢に一般化します。
驚くべきことに、タスクごとに3つの軌跡しかない10以上のタスクで96.8％の成功率を達成することができ、その並外れたサンプル効率を強調しています。
プロジェクトWebサイト：https：//bridgevla.github.io/

要約(オリジナル)

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

arxiv情報

著者	Peiyan Li,Yixiang Chen,Hongtao Wu,Xiao Ma,Xiangnan Wu,Yan Huang,Liang Wang,Tao Kong,Tieniu Tan
発行日	2025-06-09 17:36:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー