Are Your LLMs Capable of Stable Reasoning?

要約

大規模言語モデル (LLM) の急速な進歩により、複雑な推論タスクにおける目覚ましい進歩が実証されました。
ただし、ベンチマークのパフォーマンスと実際のアプリケーションの間には、依然として大きな差異が存在します。
私たちは、このギャップが主に現在の評価プロトコルと評価基準に起因していると認識しており、特に精度と一貫性の両方が重要である複雑な推論タスクにおいて、LLM 機能の全範囲を適切に捉えていないことが挙げられます。
この研究は 2 つの重要な貢献を果たします。
まず、複数のサンプリング試行にわたってモデルのパフォーマンスを継続的に評価し、モデルの潜在的なピークパフォーマンスとその安定性の両方を定量化する新しい評価メトリクスである G-Pass@k を紹介します。
2 番目に、評価中のデータ漏洩のリスクを最小限に抑えるように設計された、挑戦的で現代的な数学的問題で構成される動的ベンチマークである LiveMathBench を紹介します。
LiveMathBench を備えた最先端の LLM 上で G-Pass@k を使用した広範な実験を通じて、その最大の機能と運用の一貫性の両方について包括的な洞察を提供します。
私たちの調査結果は、LLM の「現実的な」推論能力に改善の余地がかなりあることを明らかにし、より堅牢な評価方法の必要性を強調しています。
ベンチマークと詳細な結果は、https://github.com/open-compass/GPassK で入手できます。

要約(オリジナル)

The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model’s peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs’ ‘realistic’ reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.

arxiv情報

著者	Junnan Liu,Hongwei Liu,Linchen Xiao,Ziyi Wang,Kuikun Liu,Songyang Gao,Wenwei Zhang,Songyang Zhang,Kai Chen
発行日	2024-12-18 13:05:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are Your LLMs Capable of Stable Reasoning?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー