Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

要約

大規模言語モデル (LLM) は、テキスト情報からの因果推論において優れた能力を示しています。
しかし、視覚的なヒントのみが提供されている場合、ビジョン大規模言語モデル (VLLM) では、これらの因果関係は単純なままでしょうか?
これを動機として、私たちは、動作、外観、衣服、環境などの視覚的な手がかりのみに依存する場合に、VLLM が意味論的な因果関係を推論するように挑戦する、新しいマルチモーダル因果推論ベンチマーク、すなわち MuCR を提案します。
具体的には、意味的因果関係と視覚的手がかりが埋め込まれたシャム画像を作成するためのプロンプト駆動型の画像合成アプローチを導入します。これにより、VLLM の因果推論能力を効果的に評価できます。
さらに、画像レベルの一致、フレーズレベルの理解、文レベルの説明など、複数の観点からカスタマイズされた指標を開発し、VLLM の理解能力を総合的に評価します。
私たちの広範な実験により、現在の最先端の VLLM は、私たちが期待していたほどマルチモーダルな因果推論に熟練していないことが明らかになりました。
さらに、さまざまな観点からこれらのモデルの欠点を理解し、将来の研究の方向性を提案するために包括的な分析を実行します。
私たちは、MuCR がマルチモーダル因果推論研究における貴重なリソースおよび基礎的なベンチマークとして機能できることを願っています。
プロジェクトは https://github.com/Zhiyuan-Li-John/MuCR から入手できます。

要約(オリジナル)

Large Language Models (LLMs) have showcased exceptional ability in causal reasoning from textual information. However, will these causalities remain straightforward for Vision Large Language Models (VLLMs) when only visual hints are provided? Motivated by this, we propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge VLLMs to infer semantic cause-and-effect relationship when solely relying on visual cues such as action, appearance, clothing, and environment. Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues, which can effectively evaluate VLLMs’ causal reasoning capabilities. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess VLLMs’ comprehension abilities. Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped. Furthermore, we perform a comprehensive analysis to understand these models’ shortcomings from different views and suggest directions for future research. We hope MuCR can serve as a valuable resource and foundational benchmark in multimodal causal reasoning research. The project is available at: https://github.com/Zhiyuan-Li-John/MuCR

arxiv情報

著者	Zhiyuan Li,Heng Wang,Dongnan Liu,Chaoyi Zhang,Ao Ma,Jieting Long,Weidong Cai
発行日	2024-08-15 12:04:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー