AvalonBench: Evaluating LLMs Playing the Game of Avalon

要約

この論文では、戦略的社会推理ゲームであるレジスタンスアヴァロンをプレイする際の、大規模言語モデル (LLM) エージェントの可能性を探ります。
アヴァロンのプレイヤーは、動的に展開するゲームフェーズに基づいて情報に基づいた意思決定を行うだけでなく、他のプレイヤーと騙し、推理し、交渉しなければならない議論に参加することも求められます。
これらの特性により、Avalon は、LLM エージェントの意思決定と言語処理能力を研究するための魅力的なテストベッドになります。
この分野の研究を促進するために、マルチエージェント LLM エージェントを評価するために調整された包括的なゲーム環境である AvalonBench を紹介します。
このベンチマークには、(1) Avalon のゲーム環境、(2) ベースラインの対戦相手としてのルールベースのボット、および (3) 各役割に合わせたプロンプトを備えた ReAct スタイルの LLM エージェントが組み込まれています。
特に、AvalonBench に基づく当社の評価では、明らかな能力ギャップが浮き彫りになっています。
たとえば、善役を演じる ChatGPT のようなモデルは、悪役を演じるルールベースのボットに対して 22.2% の勝率を獲得しましたが、同じ設定で善役ボットは 38.2% の勝率を達成しました。
私たちは、AvalonBench が、そのようなゲーム環境の階層的な複雑さを効果的にモデル化できる、より高度な LLM (セルフプレイ機能付き) およびエージェントフレームワークを開発するための優れたテストベッドになる可能性があると考えています。

要約(オリジナル)

In this paper, we explore the potential of Large Language Models (LLMs) Agents in playing the strategic social deduction game, Resistance Avalon. Players in Avalon are challenged not only to make informed decisions based on dynamically evolving game phases, but also to engage in discussions where they must deceive, deduce, and negotiate with other players. These characteristics make Avalon a compelling test-bed to study the decision-making and language-processing capabilities of LLM Agents. To facilitate research in this line, we introduce AvalonBench – a comprehensive game environment tailored for evaluating multi-agent LLM Agents. This benchmark incorporates: (1) a game environment for Avalon, (2) rule-based bots as baseline opponents, and (3) ReAct-style LLM agents with tailored prompts for each role. Notably, our evaluations based on AvalonBench highlight a clear capability gap. For instance, models like ChatGPT playing good-role got a win rate of 22.2% against rule-based bots playing evil, while good-role bot achieves 38.2% win rate in the same setting. We envision AvalonBench could be a good test-bed for developing more advanced LLMs (with self-playing) and agent frameworks that can effectively model the layered complexities of such game environments.

arxiv情報

著者	Jonathan Light,Min Cai,Sheng Shen,Ziniu Hu
発行日	2023-11-08 16:01:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AvalonBench: Evaluating LLMs Playing the Game of Avalon

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー