Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

要約

マルチモーダル大規模言語モデル (MLLM) は、さまざまな視覚言語タスクの処理に習熟していることを実証しています。
しかし、現在の MLLM ベンチマークは主に、単一の画像に関する静的な情報に基づいて推論を評価するように設計されており、絶えず変化する世界を理解するために不可欠な、画像シーケンスから推定する最新の MLLM の機能についてはあまり調査されていません。
この課題に対処するために、この文書では、MLLM の逐次画像推論能力を評価するために設計された新しいベンチマークである Mementos を紹介します。
Mementos には、さまざまな長さの 4,761 個の多様な画像シーケンスが含まれています。
また、MLLM 推論パフォーマンスを評価するために GPT-4 支援手法も採用しています。
GPT-4V や Gemini を含む、Mementos 上の 9 つの最近の MLLM を慎重に評価したところ、与えられた画像シーケンスに関する動的な情報を正確に記述するのに苦労しており、オブジェクトとそれに対応する動作の幻覚や誤った表現につながることが多いことがわかりました。
私たちの定量的分析とケーススタディは、MLLM の逐次イメージ推論に影響を与える 3 つの重要な要素、つまり、物体幻覚と行動幻覚の相関関係、同時発生行動の影響、行動幻覚の複合的な影響を特定します。
私たちのデータセットは https://github.com/umd-huang-lab/Mementos で入手できます。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs’ sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs’ sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.

arxiv情報

著者	Xiyao Wang,Yuhang Zhou,Xiaoyu Liu,Hongjin Lu,Yuancheng Xu,Feihong He,Jaehong Yoon,Taixi Lu,Gedas Bertasius,Mohit Bansal,Huaxiu Yao,Furong Huang
発行日	2024-01-19 07:10:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー