NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

要約

既存の視覚言語アクション（VLA）モデルは、ゼロショットシナリオで有望なパフォーマンスを示しており、印象的なタスクの実行と推論機能を実証しています。
ただし、視覚エンコードの制限から大きな課題が生じ、オブジェクトの把握などのタスク中に障害をもたらす可能性があります。
さらに、これらのモデルは通常、大きなサイズが大きいため、高い計算オーバーヘッドに悩まされ、多くの場合7Bパラメーターを超えています。
これらのモデルは推論とタスクの計画に優れていますが、彼らが発生する実質的な計算オーバーヘッドは、速度と効率が最重要であるリアルタイムのロボット環境では非現実的になります。
既存のVLAモデルの制限に対処するために、強力なタスクパフォーマンスを維持しながら計算オーバーヘッドを減らすように設計された3BパラメーターモデルであるNoraを提案します。
Noraは、QWEN-2.5-VL-3Bマルチモーダルモデルをバックボーンとして採用し、視覚的な推論とアクションの接地を強化するために優れた視覚セマンチックな理解を活用しています。
さらに、\ Model {}は970kの実世界のロボットデモンストレーションでトレーニングされ、効率的なアクションシーケンス生成のために高速+トークン剤を装備しています。
実験結果は、Noraが既存の大規模なVLAモデルよりも優れており、計算オーバーヘッドが大幅に減少し、リアルタイムのロボット自律性のためのより実用的なソリューションになるため、タスクのパフォーマンスを向上させることを示しています。

要約(オリジナル)

Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model{} is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

arxiv情報

著者	Chia-Yu Hung,Qi Sun,Pengfei Hong,Amir Zadeh,Chuan Li,U-Xuan Tan,Navonil Majumder,Soujanya Poria
発行日	2025-04-28 14:47:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー