GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks

要約

このペーパーでは、自然言語で明確に表現されたグラフ理論タスク上のLLMの推論能力を評価するために設計された包括的なベンチマークであるGraphomniを紹介します。
Graphomniには、多様なグラフタイプ、シリアル化形式、およびプロンプトスキームが含まれ、範囲と深さの両方で以前の努力を大幅に超えています。
広範な体系的な評価を通じて、これらの次元間の重要な相互作用を特定し、モデルのパフォーマンスに大きな影響を与えます。
私たちの実験は、Claude-3.5やO4-Miniなどの最先端のモデルが他のモデルよりも一貫して優れていることを明らかにしていますが、これらの主要なモデルでさえ、改善の余地がかなりあります。
パフォーマンスの変動性は、これらの相互接続された次元にわたる包括的な評価の必要性を強調していることを強調しており、検討した要因の特定の組み合わせに応じて明らかです。
さらに、オープンソースとクローズドソースモデルの間でシリアル化と促進戦略の明確な影響を観察し、カスタマイズされたアプローチの開発を促進します。
調査結果に動機付けられているため、LLMの推論能力に影響を与える最適な要因を適応的に選択する強化学習にヒントを得たフレームワークも提案します。
この柔軟で拡張可能なベンチマークは、構造化されたタスクでのLLMパフォーマンスの理解を深めるだけでなく、LLMベースのグラフ推論で研究を進めるための堅牢な基盤を提供します。
コードとデータセットはhttps://github.com/gai-community/graphomniで入手できます。

要約(オリジナル)

This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni encompasses diverse graph types, serialization formats, and prompting schemes, significantly exceeding prior efforts in both scope and depth. Through extensive systematic evaluation, we identify critical interactions among these dimensions, demonstrating their substantial impact on model performance. Our experiments reveal that state-of-the-art models like Claude-3.5 and o4-mini consistently outperform other models, yet even these leading models exhibit substantial room for improvement. Performance variability is evident depending on the specific combinations of factors we considered, underscoring the necessity of comprehensive evaluations across these interconnected dimensions. Additionally, we observe distinct impacts of serialization and prompting strategies between open-source and closed-source models, encouraging the development of tailored approaches. Motivated by the findings, we also propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities. This flexible and extendable benchmark not only deepens our understanding of LLM performance on structured tasks but also provides a robust foundation for advancing research in LLM-based graph reasoning. The code and datasets are available at https://github.com/GAI-Community/GraphOmni.

arxiv情報

著者	Hao Xu,Xiangru Jian,Xinjian Zhao,Wei Pang,Chao Zhang,Suyuchen Wang,Qixin Zhang,Zhengyuan Dong,Joao Monteiro,Bang Liu,Qiuzhuang Sun,Tianshu Yu
発行日	2025-05-28 17:47:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー