VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

要約

大規模マルチモーダルモデル (LMM) は、視覚的な理解と推論において目覚ましい成功を収め、視覚的なコンテキストでの数学的推論のパフォーマンスを著しく向上させました。
しかし、難しいタイプの視覚数学はマルチモーダルグラフ理論の問題にあり、LMM がグラフィック構造を正確に理解し、視覚グラフ上で複数ステップの推論を実行する必要があります。
さらに、マルチモーダルグラフ理論の問題を探求することは、生物学、輸送、ロボット計画などの分野でより効果的な戦略につながります。
この方向に前進するために、私たちは VisionGraph という名前のベンチマークを初めて設計しました。これは、マルチモーダルグラフ理論の問題を解決する際の高度な LMM の機能を調査するために使用されます。
接続性から最短経路の問題まで、8 つの複雑なグラフ問題のタスクが含まれます。
続いて、グラフィカルな構造記述の生成とアルゴリズムを意識した複数ステップの推論を通じて、推論プロセスの論理的精度を向上させる記述-プログラム-推論 (DPR) チェーンを紹介します。
私たちの広範な調査によると、1) GPT-4V は、複数ステップのグラフ推論において Gemini Pro よりも優れています。
2) すべての LMM は、ゼロ/少数ショット設定であっても、教師あり微調整 (SFT) であっても、グラフィック構造の認識精度が劣っており、問題解決のパフォーマンスにさらに影響を及ぼします。
3) DPR は、LMM のマルチステップグラフ推論機能を大幅に向上させ、GPT-4V (DPR) エージェントは SOTA パフォーマンスを達成します。

要約(オリジナル)

Large Multimodal Models (LMMs) have achieved impressive success in visual understanding and reasoning, remarkably improving the performance of mathematical reasoning in a visual context. Yet, a challenging type of visual math lies in the multimodal graph theory problem, which demands that LMMs understand the graphical structures accurately and perform multi-step reasoning on the visual graph. Additionally, exploring multimodal graph theory problems will lead to more effective strategies in fields like biology, transportation, and robotics planning. To step forward in this direction, we are the first to design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems. It encompasses eight complex graph problem tasks, from connectivity to shortest path problems. Subsequently, we present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes through graphical structure description generation and algorithm-aware multi-step reasoning. Our extensive study shows that 1) GPT-4V outperforms Gemini Pro in multi-step graph reasoning; 2) All LMMs exhibit inferior perception accuracy for graphical structures, whether in zero/few-shot settings or with supervised fine-tuning (SFT), which further affects problem-solving performance; 3) DPR significantly improves the multi-step graph reasoning capabilities of LMMs and the GPT-4V (DPR) agent achieves SOTA performance.

arxiv情報

著者	Yunxin Li,Baotian Hu,Haoyuan Shi,Wei Wang,Longyue Wang,Min Zhang
発行日	2024-05-08 10:42:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー