VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

要約

マルチモーダル大規模言語モデル (MLLM) は、テキスト情報と視覚情報を統合して複雑なシナリオを視覚的に理解することにより、さまざまなタスクにわたって有望な機能を実証してきました。
視覚的な質問応答から複雑な問題解決までのタスクで MLLM を評価することを目的としたベンチマークがいくつか利用可能であるにもかかわらず、ほとんどのベンチマークは主に数学または一般的な視覚的理解タスクに焦点を当てています。
これは、物理学や化学などの他の主要な科学分野の包含がしばしば見落とされている、現在のベンチマークにおける重大なギャップを明らかにしています。
このギャップに対処するために、私たちは VisScience という名前の包括的なベンチマークを細心の注意を払って構築しました。これは、数学、物理学、化学の 3 つの分野にわたるマルチモーダルな科学的推論を評価するために利用されます。
このベンチマークは、小学校から高校までの K12 教育から抽出された 3,000 の質問で構成され、3 つの分野に均等に分散されており、分野ごとに 1,000 の質問があります。
VisScience 内の質問は 21 の異なる主題にわたっており、5 つの難易度に分類されており、各分野の幅広いトピックを提供しています。
VisScience では、科学的推論における 25 の代表的な MLLM のパフォーマンスの詳細な評価を提示します。
実験結果は、クローズドソース MLLM が一般にオープンソースモデルよりも優れたパフォーマンスを発揮することを示しています。
観察された最高のパフォーマンスには、Claude3.5-Sonnet による数学の精度 53.4\%、GPT-4o による物理学の 38.2\%、Gemini-1.5-Pro による化学の精度 47.0\% が含まれます。
これらの結果は、MLLM の強みと限界を強調し、将来の改善領域を示唆し、マルチモーダルな科学的推論の多様な要求を効果的に処理できるモデルを開発することの重要性を強調しています。

要約(オリジナル)

Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education – spanning elementary school through high school – equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4\% accuracy in mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.

arxiv情報

著者	Zhihuan Jiang,Zhen Yang,Jinhao Chen,Zhengxiao Du,Weihan Wang,Bin Xu,Jie Tang
発行日	2024-12-02 15:11:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー