GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

要約

大規模言語モデル (LLM) とマルチモーダルモデル (MM) の最近の進歩により、問題解決におけるその驚くべき能力が実証されました。
しかし、テキスト情報と視覚情報の両方を総合的に理解する必要がある幾何学の数学問題に取り組む能力は、十分に評価されていません。
このギャップに対処するために、GeoEval ベンチマークを導入します。これは、2000 問題のメインサブセット、後方推論に焦点を当てた 750 問題のサブセット、2000 問題の拡張サブセット、および 300 問題のハードサブセットを含む包括的なコレクションです。
このベンチマークは、幾何数学の問題を解決する際の LLM と MM のパフォーマンスのより深い調査を容易にします。
これらのさまざまなサブセットにわたる 10 個の LLM と MM の評価では、WizardMath モデルが優れており、主要なサブセットでは 55.67% の精度を達成しましたが、困難なサブセットではわずか 6.00% の精度を達成したことが明らかになりました。
これは、事前トレーニングされていないデータセットに対してモデルをテストする重要な必要性を強調しています。
さらに、我々の調査結果は、GPT シリーズのモデルが言い換えられた問題に対してより効果的に動作することを示しており、モデルの機能を強化するための有望な方法を示唆しています。

要約(オリジナル)

Recent advancements in Large Language Models (LLMs) and Multi-Modal Models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2000 problems, a 750 problem subset focusing on backward reasoning, an augmented subset of 2000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs on solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the challenging subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.

arxiv情報

著者	Jiaxin Zhang,Zhongzhi Li,Mingliang Zhang,Fei Yin,Chenglin Liu,Yashar Moshfeghi
発行日	2024-02-15 16:59:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー