M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

要約

基礎モデルを評価するための既存のベンチマークは、主に単一ドキュメントのテキストのみのタスクに焦点を当てています。
しかし、多くの場合、非テキストデータの解釈や複数の文書にわたる情報の収集が含まれるリサーチワークフローの複雑さを完全には把握できません。
このギャップに対処するために、基盤モデルのより包括的な評価を目的として設計された、マルチモーダル、マルチドキュメントの科学的質問回答ベンチマークである M3SciQA を導入します。
M3SciQA は、70 の自然言語処理論文クラスターにまたがる 1,452 の専門家による注釈付きの質問で構成されており、各クラスターは主な論文とそのすべての引用文献を表し、マルチモーダルおよびマルチドキュメントデータを必要とする単一の論文を理解するワークフローを反映しています。
M3SciQAでは、18の基礎モデルを総合的に評価します。
私たちの結果は、現在の基盤モデルは、マルチモーダルな情報検索や複数の科学文書にわたる推論において、人間の専門家と比較して依然として大幅に性能が劣っていることを示しています。
さらに、マルチモーダルな科学文献分析における基礎モデルの適用の将来の進歩に対するこれらの発見の意味を探ります。

要約(オリジナル)

Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.

arxiv情報

著者	Chuhan Li,Ziyao Shangguan,Yilun Zhao,Deyuan Li,Yixin Liu,Arman Cohan
発行日	2024-11-06 17:52:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー