Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

要約

三目並べ、コネクトフォー、五目並べなどのグリッドベースのゲームを通じて、大規模言語モデル (LLM) の斬新で拡張可能なベンチマークを紹介します。
GitHub で入手できるオープンソースのゲームシミュレーションコードを使用すると、LLM が競争できるようになり、リーダーボードのランキングやさらなる分析のために JSON、CSV、TXT、および PNG 形式で詳細なデータファイルを生成できます。
Anthropic の Claude 3.5 Sonnet と Claude 3 Sonnet、Google の Gemini 1.5 Pro と Gemini 1.5 Flash、OpenAI の GPT-4 Turbo と GPT-4o、Meta の Llama3-70B など、主要な LLM 間のゲームの結果を紹介します。
他の LLM からの結果の提出も奨励します。
リスト、イラスト、イメージという 3 つの異なるプロンプトタイプを使用して、3 種類のゲームにわたって合計 2,310 の試合 (7 人の LLM とランダムなプレーヤーの各ペアに 5 セッション) をシミュレートしました。
その結果、勝率と失格率、逃した機会の分析、無効な手の分析を対象とした分析により、さまざまなゲームやプロンプトの種類間で LLM のパフォーマンスに大きなばらつきがあることが明らかになりました。
リーダーボードと結果マトリックスのデータの詳細は、GitHub でオープンアクセスデータとして利用できます。
この研究は、特に訓練を受けていないゲームをプレイする際の LLM の能力についての理解を深め、彼らのルール理解力と戦略的思考を評価するのに役立ちます。
この研究は、汎用人工知能 (AGI) への道において、複雑な意思決定シナリオにおけるその有用性を将来探求するための基礎を築き、彼らの戦略的思考能力を明らかにし、ゲームベースのフレームワーク内での LLM の限界をさらに調査するための方向性を提供します。
。

要約(オリジナル)

We introduce a novel and extensible benchmark for large language models (LLMs) through grid-based games such as Tic-Tac-Toe, Connect Four, and Gomoku. The open-source game simulation code, available on GitHub, allows LLMs to compete and generates detailed data files in JSON, CSV, TXT, and PNG formats for leaderboard rankings and further analysis. We present the results of games among leading LLMs, including Claude 3.5 Sonnet and Claude 3 Sonnet by Anthropic, Gemini 1.5 Pro and Gemini 1.5 Flash by Google, GPT-4 Turbo and GPT-4o by OpenAI, and Llama3-70B by Meta. We also encourage submissions of results from other LLMs. In total, we simulated 2,310 matches (5 sessions for each pair among 7 LLMs and a random player) across three types of games, using three distinct prompt types: list, illustration, and image. The results revealed significant variations in LLM performance across different games and prompt types, with analysis covering win and disqualification rates, missed opportunity analysis, and invalid move analysis. The details of the leaderboard and result matrix data are available as open-access data on GitHub. This study enhances our understanding of LLMs’ capabilities in playing games they were not specifically trained for, helping to assess their rule comprehension and strategic thinking. On the path to Artificial General Intelligence (AGI), this study lays the groundwork for future exploration into their utility in complex decision-making scenarios, illuminating their strategic thinking abilities and offering directions for further inquiry into the limits of LLMs within game-based frameworks.

arxiv情報

著者	Oguzhan Topsakal,Colby Jacob Edell,Jackson Bailey Harper
発行日	2024-07-11 03:46:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー