PointVLA: Injecting the 3D World into Vision-Language-Action Models

要約

Vision-Language-action（VLA）モデルは、大規模な2Dビジョン言語前の事前トレーニングを活用することでロボットタスクに優れていますが、RGBイメージへの依存により、実際の相互作用に重要な空間的推論が制限されます。
3Dデータを使用してこれらのモデルを再編成することは計算的に禁止されていますが、既存の2Dデータセットを破棄することは貴重なリソースを廃棄します。
このギャップを埋めるために、再訓練を必要とせずにポイントクラウド入力で事前に訓練されたVLAを強化するフレームワークであるPointVLAを提案します。
私たちの方法は、バニラアクションエキスパートをフリーズし、軽量モジュラーブロックを介して3D機能を注入します。
ポイントクラウド表現を統合する最も効果的な方法を特定するために、スキップブロック分析を実施して、バニラアクションエキスパートのあまり有用でないブロックを特定し、3D機能がこれらのブロックにのみ注入され、事前に訓練された表現の破壊を最小化することを保証します。
広範な実験は、PointVLAが、シミュレートされたロボットタスクと現実世界のロボットタスクの両方で、OpenVLA、拡散ポリシー、DexVLAなどの最先端の2D模倣学習方法を上回ることを示しています。
具体的には、ポイントクラウド統合によって有効になっているPointVLAのいくつかの重要な利点を強調します。（1）PointVLAがそれぞれ20のデモンストレーションを使用して4つの異なるタスクを正常に実行するいくつかのショットマルチタスク。
（2）Pointvlaが実際のオブジェクトを画像と区別し、3D世界知識を活用して安全性と信頼性を向上させる現実VS-Photo差別。
（3）高さの適応性は、従来の2D模倣学習方法とは異なり、PointVLAを使用すると、ロボットは列車のデータで見えないさまざまなテーブルの高さでオブジェクトに適応できます。
さらに、PointVLAは、動くコンベヤーベルトからオブジェクトをピッキングしたり梱包したりするなど、長期のタスクで強力なパフォーマンスを達成し、複雑で動的な環境を介して一般化する能力を紹介します。

要約(オリジナル)

Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less useful blocks in the vanilla action expert, ensuring that 3D features are injected only into these blocks–minimizing disruption to pre-trained representations. Extensive experiments demonstrate that PointVLA outperforms state-of-the-art 2D imitation learning methods, such as OpenVLA, Diffusion Policy and DexVLA, across both simulated and real-world robotic tasks. Specifically, we highlight several key advantages of PointVLA enabled by point cloud integration: (1) Few-shot multi-tasking, where PointVLA successfully performs four different tasks using only 20 demonstrations each; (2) Real-vs-photo discrimination, where PointVLA distinguishes real objects from their images, leveraging 3D world knowledge to improve safety and reliability; (3) Height adaptability, Unlike conventional 2D imitation learning methods, PointVLA enables robots to adapt to objects at varying table height that unseen in train data. Furthermore, PointVLA achieves strong performance in long-horizon tasks, such as picking and packing objects from a moving conveyor belt, showcasing its ability to generalize across complex, dynamic environments.

arxiv情報

著者	Chengmeng Li,Junjie Wen,Yan Peng,Yaxin Peng,Feifei Feng,Yichen Zhu
発行日	2025-03-10 16:32:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PointVLA: Injecting the 3D World into Vision-Language-Action Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー