Benchmarking Large Language Models for Math Reasoning Tasks

要約

数学的推論における大規模言語モデル (LLM) の使用は、関連研究の基礎となっており、これらのモデルのインテリジェンスを実証し、教育現場などでその高度なパフォーマンスを通じて潜在的な実用的なアプリケーションを可能にします。
LLM の数学的問題解決を自動化する能力を向上させるために設計されたさまざまなデータセットとコンテキスト内学習アルゴリズムにもかかわらず、さまざまなデータセットにわたる包括的なベンチマークが欠如しているため、特定のタスクに適切なモデルを選択することが複雑になっています。
このプロジェクトでは、4 つの強力な基礎モデル上で広く使用されている 5 つの数学的データセットにわたって、数学的問題解決のための 7 つの最先端のインコンテキスト学習アルゴリズムを公平に比較するベンチマークを紹介します。
さらに、効率とパフォーマンスの間のトレードオフを調査し、数学的推論のための LLM の実際的な応用例に焦点を当てます。
私たちの結果は、GPT-4o や LLaMA 3-70B のような大規模な基礎モデルは、具体的なプロンプト戦略とは独立して数学的推論を解決できるのに対し、小規模なモデルでは、コンテキスト内学習アプローチがパフォーマンスに大きな影響を与えることを示しています。
さらに、最適なプロンプトは、選択した基礎モデルによって異なります。
今後の研究における追加モデルの統合をサポートするために、ベンチマークコードをオープンソース化します。

要約(オリジナル)

The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.

arxiv情報

著者	Kathrin Seßler,Yao Rong,Emek Gözlüklü,Enkelejda Kasneci
発行日	2024-12-19 15:25:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Large Language Models for Math Reasoning Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー