WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

要約

ドキュメントは、情報を保存および普及させることの基本であり、自動ドキュメント理解に大きな課題をもたらす複雑なレイアウト、表、およびチャートを組み込んだことがよくあります（DU）。
Vision-Language Large Models（VLLMS）はさまざまなタスクにわたって改善を実証していますが、長いコンテキスト視力入力の処理におけるそれらの有効性は不明のままです。
このペーパーでは、7つの異なるトピックにまたがる4,000のウィキペディアページから抽出されたテーブルとチャートのクロスモーダル推論を評価するために設計された1,000件の多肢選択質問（MCQ）を含むベンチマークであるWikimixQaを紹介します。
既存のベンチマークとは異なり、WikimixQaは、モデルに複数のモダリティからの情報を合成することをモデルに要求することにより、複雑な推論を強調しています。
12の最先端のビジョン言語モデルを評価し、独自のモデルが直接コンテキストを提供すると約70％の精度を達成する一方で、長いドキュメントからの検索が必要な場合、パフォーマンスは大幅に悪化することが明らかになります。
これらのうち、GPT-4-Oはこの設定で50％の精度を超える唯一のモデルですが、オープンソースモデルはかなり悪化し、最大精度は27％です。
これらの調査結果は、長いコンテキスト、マルチモーダル推論の課題を強調し、ドキュメント理解の研究を進めるための重要なベンチマークとしてWikimixqaを確立します。

要約(オリジナル)

Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

arxiv情報

著者	Negar Foroutan,Angelika Romanou,Matin Ansaripour,Julian Martin Eisenschlos,Karl Aberer,Rémi Lebret
発行日	2025-06-18 16:09:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー