PokerBench: Training Large Language Models to become Professional Poker Players

要約

大規模言語モデル (LLM) のポーカープレイ能力を評価するためのベンチマークである PokerBench を紹介します。
LLM は従来の NLP タスクでは優れていますが、ポーカーのような複雑で戦略的なゲームに LLM を適用すると、新たな課題が生じます。
ポーカーは不完全情報ゲームであり、数学、推論、計画、戦略、ゲーム理論や人間心理の深い理解など、多くのスキルが必要です。
このため、ポーカーは大規模な言語モデルにとって理想的な次のフロンティアになります。
PokerBench は、訓練されたポーカープレーヤーと協力して開発された、プリフロッププレイとポストフロッププレイに分かれた 11,000 の最も重要なシナリオの包括的な編集で構成されています。
私たちは GPT-4、ChatGPT 3.5、さまざまな Llama および Gemma シリーズモデルを含む著名なモデルを評価し、すべての最先端の LLM が最適なポーカーをプレイするにはパフォーマンスが劣ることを発見しました。
ただし、微調整後、これらのモデルは顕著な改善を示します。
異なるスコアを持つモデルを相互に競争させることで PokerBench を検証し、PokerBench のスコアが高いほど実際のポーカーゲームでの勝率が高くなることが実証されました。
微調整されたモデルと GPT-4 間のゲームプレイを通じて、最適なプレイ戦略を学習するための単純な教師あり微調整の限界も特定し、ゲームで優れた言語モデルを効果的にトレーニングするためのより高度な方法論の必要性を示唆しています。
したがって、PokerBench は、LLM のポーカープレイ能力を迅速かつ信頼性高く評価するための独自のベンチマークと、複雑なゲームプレイシナリオにおける LLM の進歩を研究するための包括的なベンチマークを提供します。
データセットとコードは \url{https://github.com/pokerllm/pokerbench} で入手可能になります。

要約(オリジナル)

We introduce PokerBench – a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: \url{https://github.com/pokerllm/pokerbench}.

arxiv情報

著者	Richard Zhuang,Akshat Gupta,Richard Yang,Aniket Rahane,Zhengyu Li,Gopala Anumanchipalli
発行日	2025-01-14 18:59:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PokerBench: Training Large Language Models to become Professional Poker Players

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー