Embodied Scene Understanding for Vision Language Models via MetaVQA

要約

ビジョン言語モデル (VLM) は、さまざまなモビリティアプリケーションの具体化された AI エージェントとして大きな可能性を示しています。
しかし、空間推論と逐次意思決定能力を評価するための標準化された閉ループベンチマークが不足しています。
これに対処するために、Visual Question Answering (VQA) と閉ループシミュレーションを通じて、VLM の空間関係とシーンダイナミクスの理解を評価および強化するように設計された包括的なベンチマークである MetaVQA を紹介します。
MetaVQA は、nuScenes と Waymo データセットからの Set-of-Mark プロンプトとトップダウンビューのグラウンドトゥルースアノテーションを活用して、現実世界の多様な交通シナリオに基づいて広範な質問と回答のペアを自動的に生成し、オブジェクト中心でコンテキストに富んだ指示を保証します。
私たちの実験では、MetaVQA データセットを使用して VLM を微調整すると、安全性が重要なシミュレーションにおける空間推論と具体化されたシーンの理解が大幅に向上することが示されており、これは VQA の精度の向上だけでなく、新たな安全性を意識した運転操作でも明らかです。
さらに、学習はシミュレーションから現実世界の観察への強力な移行可能性を示します。
コードとデータは https://metadriverse.github.io/metavqa で公開されます。

要約(オリジナル)

Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs’ understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at https://metadriverse.github.io/metavqa .

arxiv情報

著者	Weizhen Wang,Chenda Duan,Zhenghao Peng,Yuxin Liu,Bolei Zhou
発行日	2025-01-15 21:36:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Embodied Scene Understanding for Vision Language Models via MetaVQA

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー