STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

要約

与えられた大規模な言語モデル（LLM）が経済的推論を確実に実行できるかどうかをどのように判断すべきですか？
既存のほとんどのLLMベンチマークは、特定のアプリケーションに焦点を当てており、モデルに多様な経済的タスクを豊富に提示できません。
顕著な例外は、Ramanらです。
[2024]、戦略的意思決定を包括的にベンチマークするためのアプローチを提供する。
ただし、このアプローチでは、供給とデマンド分析などのマイクロ経済学で一般的な非戦略的設定に対処することができません。
マイクロ経済の推論を58ドルの異なる要素に分類することにより、このギャップに対処し、供給と需要の論理に焦点を当て、それぞれが最大10ドルの異なるドメイン、5ドルの視点、3ドルの種類に基づいています。
この組み合わせ空間にわたるベンチマークデータの生成は、自動ステアをダビングする新しいLLM支援データ生成プロトコルを搭載しています。これは、手書きテンプレートを新しいドメインと視点をターゲットにすることにより、一連の質問を生成します。
新鮮な質問を生成する自動化された方法を提供するため、Auto-Steerは、LLMが過剰に適合する評価ベンチマークのトレーニングを受けるリスクを軽減します。
したがって、これが今後数年間、モデルを評価し、微調整するための有用なツールとして機能することを願っています。
小さなオープンソースモデルから現在の最新の最新モデルに至るまで、27ドルの$ LLMSのケーススタディを介してベンチマークの有用性を示しています。
各分類法全体でミクロ経済の問題を解決する各モデルの能力を調べ、さまざまな促進戦略とスコアリングメトリックにわたって結果を提示しました。

要約(オリジナル)

How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art. We examined each model’s ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.

arxiv情報

著者	Narun Raman,Taylor Lundy,Thiago Amin,Jesse Perla,Kevin Leyton-Brown
発行日	2025-02-19 02:54:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー