How Far Are We on the Decision-Making of LLMs? Evaluating LLMs’ Gaming Ability in Multi-Agent Environments

要約

意思決定は、多様な能力を必要とする複雑なプロセスであり、大規模な言語モデル（LLM）を評価するための優れたフレームワークとなっています。
研究者は、ゲーム理論のレンズを通してLLMSの意思決定を調査しました。
ただし、既存の評価は、主にLLMが別のプレイヤーと競合する2つのプレイヤーシナリオに焦点を当てています。
さらに、以前のベンチマークは、静的な設計により、テストセットの漏れに悩まされています。
Multi-Agent環境でLLMSのゲーム能力を評価するための新しいフレームワークであるGama（$ \ gamma $） – ベンチを紹介します。
8つの古典的なゲーム理論シナリオと、LLMSのパフォーマンスを定量的に評価するために特別に設計された動的なスコアリングスキームが含まれます。
$ \ gamma $ -benchは、柔軟なゲーム設定を可能にし、スコアリングシステムをさまざまなゲームパラメーターに適応させ、堅牢性、一般化可能性、および改善戦略の包括的な評価を可能にします。
我々の結果は、GPT-3.5が強い堅牢性を示しているが、一般化が制限されていることを示しています。
また、GPT-3.5、GPT-4、GEMINI、LLAMA-3.1、MIXTRAL、QWEN-2を含む6つのモデルファミリから13 LMSを評価します。
Gemini-1.5-Proは他の人を上回り、100ドルのうち69.8ドルを獲得し、続いてLlama-3.1-70b（$ 65.9 $）およびMixtral-8x22b（$ 62.4 $）が続きます。
コードと実験結果は、https://github.com/cuhk-arise/gamabenchで公開されています。

要約(オリジナル)

Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating Large Language Models (LLMs). Researchers have examined LLMs’ decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework for evaluating LLMs’ Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs’ performance. $\gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate 13 LLMs from 6 model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $69.8$ out of $100$, followed by LLaMA-3.1-70B ($65.9$) and Mixtral-8x22B ($62.4$). Our code and experimental results are publicly available at https://github.com/CUHK-ARISE/GAMABench.

arxiv情報

著者	Jen-tse Huang,Eric John Li,Man Ho Lam,Tian Liang,Wenxuan Wang,Youliang Yuan,Wenxiang Jiao,Xing Wang,Zhaopeng Tu,Michael R. Lyu
発行日	2025-03-06 18:58:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs’ Gaming Ability in Multi-Agent Environments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー