ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

要約

大規模な視覚言語モデル（LVLMS）によるロボットの一般化の強化がますます明らかになっています。
したがって、エゴセントリックビデオに基づいたLVLMの具体化された認知能力は非常に興味深いものです。
ただし、具体化されたビデオ質問の現在のデータセットには、包括的で体系的な評価フレームワークがありません。
ロボットの自己認知、動的なシーンの知覚、幻覚などの重要な具体化された認知の問題はめったに対処されません。
これらの課題に取り組むために、LVLMSの具体化された認知能力を体系的に評価するために設計された高品質のベンチマークであるEcbenchを提案します。
Ecbenchは、さまざまなシーンビデオソース、オープンおよび多様な質問形式、および具体化された認知の30次元を備えています。
品質、バランス、視覚的依存度を確保するために、Ecbenchはクラスに依存しない細心の人間の注釈とマルチラウンドの質問スクリーニング戦略を使用します。
さらに、指標の公平性と合理性を保証する包括的な評価システムであるEcvalを紹介します。
Ecbenchを利用して、独自、オープンソース、およびタスク固有のLVLMの広範な評価を実施します。
Ecbenchは、LVLMSの具体化された認知能力を進める上で極めて重要であり、具体化されたエージェント向けの信頼できるコアモデルを開発するための強固な基盤を築きます。
すべてのデータとコードは、https：//github.com/rh-dang/ecbenchで入手できます。

要約(オリジナル)

The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at https://github.com/Rh-Dang/ECBench.

arxiv情報

著者	Ronghao Dang,Yuqian Yuan,Wenqi Zhang,Yifei Xin,Boqiang Zhang,Long Li,Liuyi Wang,Qinyang Zeng,Xin Li,Lidong Bing
発行日	2025-03-13 07:45:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー