VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

要約

大規模言語モデル (LLM) および大規模マルチモーダルモデル (LMM) に関するこれまでの研究では、視覚的コンテキスト内での数学的問題解決 (MPS) が体系的に検討されてきましたが、問題解決中にこれらのモデルが視覚情報をどのように処理するかについての分析は依然として不十分です。
このギャップに対処するために、視覚情報に関連する MPS プロセスを評価するためのベンチマークである VisAidMath を紹介します。
当社は、自動化されたプロセスと手動の注釈の両方を含む厳格なデータキュレーションパイプラインに従って、データの品質と信頼性を確保しています。
したがって、このベンチマークには、教科書、試験問題、オリンピックの問題などのさまざまな情報源から収集された、さまざまな数学分野、視覚補助公式、難易度レベルからの 1,200 の難しい問題が含まれています。
提案されたベンチマークに基づいて、10 の主流 LLM および LMM について包括的な評価を実施し、視覚支援推論プロセスの欠陥を浮き彫りにします。
たとえば、GPT-4V は、ゴールデン視覚補助を使用した場合でも 2 ポイント低下したにもかかわらず、視覚補助推論タスクで 45.33% の精度しか達成しません。
詳細な分析により、欠陥の主な原因は暗黙の視覚推論プロセスに関する幻覚にあることが明らかになり、視覚支援MPSプロセスにおける将来の研究の方向性が明らかになりました。

要約(オリジナル)

Although previous research on large language models (LLMs) and large multi-modal models (LMMs) has systematically explored mathematical problem-solving (MPS) within visual contexts, the analysis of how these models process visual information during problem-solving remains insufficient. To address this gap, we present VisAidMath, a benchmark for evaluating the MPS process related to visual information. We follow a rigorous data curation pipeline involving both automated processes and manual annotations to ensure data quality and reliability. Consequently, this benchmark includes 1,200 challenging problems from various mathematical branches, vision-aid formulations, and difficulty levels, collected from diverse sources such as textbooks, examination papers, and Olympiad problems. Based on the proposed benchmark, we conduct comprehensive evaluations on ten mainstream LLMs and LMMs, highlighting deficiencies in the visual-aided reasoning process. For example, GPT-4V only achieves 45.33% accuracy in the visual-aided reasoning task, even with a drop of 2 points when provided with golden visual aids. In-depth analysis reveals that the main cause of deficiencies lies in hallucination regarding the implicit visual reasoning process, shedding light on future research directions in the visual-aided MPS process.

arxiv情報

著者	Jingkun Ma,Runzhe Zhan,Derek F. Wong,Yang Li,Di Sun,Hou Pong Chan,Lidia S. Chao
発行日	2024-10-30 13:19:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー