SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

要約

さまざまなセクターにわたるマルチモーダル大手言語モデル（MLLM）の適用の増加により、出力の信頼性と精度、特に実際の情報に基づいたコンテンツを作成する能力（一般的およびドメイン固有の知識など）の本質にスポットライトを当てました。
この作業では、自然言語の短い質問に答えるためのMLLMの事実性能力を評価するための最初の包括的なマルチモーダルベンチマークであるSimpleVQAを紹介します。
SimpleVQAは、6つの重要な機能によって特徴付けられます。複数のタスクと複数のシナリオをカバーし、高品質で挑戦的なクエリを保証し、静的で時代を超越した参照の回答を維持し、評価するのが簡単です。
私たちのアプローチでは、視覚的な質問を客観的なイベントや一般的な知識に関する9つの異なるタスクに分類し、9つのトピック内にこれらを位置づけることが含まれます。
高品質の、簡潔な、明確な回答を保証するために、厳密な品質管理プロセスが実装され、LLM-As-a-Judgeスコアリングシステムを介して最小限の分散で評価を促進します。
SimpleVQAを使用して、エラーケースを特定して分析することにより、主要な18 Mllmsと8つのテキストのみのLLMの包括的な評価を実行し、画像の理解とテキスト生成能力を掘り下げます。

要約(オリジナル)

The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.

arxiv情報

著者	Xianfu Cheng,Wei Zhang,Shiwei Zhang,Jian Yang,Xiangyuan Guan,Xianjie Wu,Xiang Li,Ge Zhang,Jiaheng Liu,Yuying Mai,Yutao Zeng,Zhoufutu Wen,Ke Jin,Baorui Wang,Weixiao Zhou,Yunhong Lu,Tongliang Li,Wenhao Huang,Zhoujun Li
発行日	2025-02-18 17:04:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー