Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

要約

大規模なマルチモーダルモデル（LMM）は、さまざまな視覚言語のタスクで顕著な成功を収めています。
ただし、既存のベンチマークは主に単一イメージの理解に焦点を当てており、画像シーケンスの分析はほとんど説明されていません。
この制限に対処するために、spripcipherを紹介します。これは、LMMの能力を評価して、シーケンシャル画像を理解し、推論するために設計された包括的なベンチマークです。
Stripcipherは、視覚的な物語の理解、コンテキストフレームの予測、および時間的物語の並べ替えの3つの挑戦的なサブタスクと、人間が解決したデータセットと3つの挑戦的なサブタスクを備えています。
GPT-4OやQWEN2.5VLを含む16ドルの最先端のLMMSの評価は、特にシャッフルされたシーケンシャル画像を並べ替える必要があるタスクで、人間の能力と比較して大きなパフォーマンスギャップを明らかにしています。
たとえば、GPT-4oは、並べ替えサブタスクで23.93％の精度しか達成されていません。これは、人間のパフォーマンスよりも56.07％低いです。
さらなる定量分析は、画像の入力形式、順次理解におけるLLMSのパフォーマンスに影響を与えるなど、いくつかの要因を議論し、LMMの開発に残っている基本的な課題を強調しています。

要約(オリジナル)

Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.

arxiv情報

著者	Xiaochen Wang,Heming Xia,Jialin Song,Longyu Guan,Yixin Yang,Qingxiu Dong,Weiyao Luo,Yifan Pu,Yiru Wang,Xiangdi Meng,Wenjie Li,Zhifang Sui
発行日	2025-02-19 18:04:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー