Can Vision-Language Models Evaluate Handwritten Math?

要約

視覚言語モデル (VLM) の最近の進歩により、特に数学において、生徒の手書き回答の自動採点に新たな可能性が開かれました。
しかし、手書きコンテンツを評価し推論する VLM の能力をテストするための包括的な研究は依然として存在していません。
このギャップに対処するために、手書きの数学的コンテンツのエラーを検出、位置特定、修正する VLM の能力を評価するために設計されたベンチマークである FERMAT を導入します。
FERMAT は、計算、概念的、表記、表示という 4 つの主要なエラーの側面にまたがり、意図的に摂動を導入した 7 年生から 12 年生までの手動で精選された 609 問の問題から派生した 2,200 を超える手書きの数学ソリューションで構成されています。
FERMAT を使用して、エラー検出、位置特定、修正という 3 つのタスクにわたって 9 つの VLM をベンチマークしました。
私たちの結果では、Gemini-1.5-Pro が最高の誤り訂正率 (77%) を達成するなど、手書きテキストに対する推論における現在の VLM の重大な欠点が明らかになりました。
また、手書き入力を印刷されたテキストまたは画像に置き換えると精度が向上するため、一部のモデルでは手書きコンテンツの処理に苦労していることも観察されました。
これらの調査結果は、現在の VLM の限界を浮き彫りにし、改善のための新たな道を明らかにします。
私たちはさらなる研究を推進するために、FERMAT とすべての関連リソースをオープンソースでリリースします。

要約(オリジナル)

Recent advancements in Vision-Language Models (VLMs) have opened new possibilities in automatic grading of handwritten student responses, particularly in mathematics. However, a comprehensive study to test the ability of VLMs to evaluate and reason over handwritten content remains absent. To address this gap, we introduce FERMAT, a benchmark designed to assess the ability of VLMs to detect, localize and correct errors in handwritten mathematical content. FERMAT spans four key error dimensions – computational, conceptual, notational, and presentation – and comprises over 2,200 handwritten math solutions derived from 609 manually curated problems from grades 7-12 with intentionally introduced perturbations. Using FERMAT we benchmark nine VLMs across three tasks: error detection, localization, and correction. Our results reveal significant shortcomings in current VLMs in reasoning over handwritten text, with Gemini-1.5-Pro achieving the highest error correction rate (77%). We also observed that some models struggle with processing handwritten content, as their accuracy improves when handwritten inputs are replaced with printed text or images. These findings highlight the limitations of current VLMs and reveal new avenues for improvement. We release FERMAT and all the associated resources in the open-source to drive further research.

arxiv情報

著者	Oikantik Nath,Hanani Bathina,Mohammed Safi Ur Rahman Khan,Mitesh M. Khapra
発行日	2025-01-13 11:52:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Vision-Language Models Evaluate Handwritten Math?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー