Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

要約

視覚言語動作 (VLA) モデルは、汎用ロボットシステム開発の有望な方向性を示しており、視覚的理解、言語理解、動作生成を組み合わせる能力を実証しています。
しかし、さまざまなロボットタスクにわたるこれらのモデルの体系的な評価は依然として限られています。
この研究では、VLA モデルを評価するための包括的な評価フレームワークとベンチマークスイートを紹介します。
Open-X-Embodiment コレクションの 20 の多様なデータセットにわたって 3 つの最先端の VLM および VLA (GPT-4o、OpenVLA、および JAT) をプロファイリングし、さまざまな操作タスクでのパフォーマンスを評価します。
私たちの分析により、いくつかの重要な洞察が明らかになりました。 1. 現在の VLA モデルは、さまざまなタスクやロボットプラットフォーム間でパフォーマンスに大きなばらつきがあり、GPT-4o は洗練されたプロンプトエンジニアリングを通じて最も一貫したパフォーマンスを示しています。 2. すべてのモデルは、複数のタスクを必要とする複雑な操作タスクに苦労しています。
ステップ計画、および 3. モデルのパフォーマンスは、アクションスペースの特性と環境要因に特に敏感です。
私たちは、将来の VLA モデルの体系的な評価を促進し、汎用ロボットシステムの開発における改善のための重要な領域を特定するために、評価フレームワークと調査結果を公開します。

要約(オリジナル)

Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems, demonstrating the ability to combine visual understanding, language comprehension, and action generation. However, systematic evaluation of these models across diverse robotic tasks remains limited. In this work, we present a comprehensive evaluation framework and benchmark suite for assessing VLA models. We profile three state-of-the-art VLM and VLAs – GPT-4o, OpenVLA, and JAT – across 20 diverse datasets from the Open-X-Embodiment collection, evaluating their performance on various manipulation tasks. Our analysis reveals several key insights: 1. current VLA models show significant variation in performance across different tasks and robot platforms, with GPT-4o demonstrating the most consistent performance through sophisticated prompt engineering, 2. all models struggle with complex manipulation tasks requiring multi-step planning, and 3. model performance is notably sensitive to action space characteristics and environmental factors. We release our evaluation framework and findings to facilitate systematic assessment of future VLA models and identify critical areas for improvement in the development of general purpose robotic systems.

arxiv情報

著者	Pranav Guruprasad,Harshvardhan Sikka,Jaewoo Song,Yangyue Wang,Paul Pu Liang
発行日	2024-12-08 06:54:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー