BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

要約

大規模言語モデル (LLM) はますます強力になり、単一エージェントやマルチエージェントシステムの構築など、複雑なタスクを処理できるようになってきています。
シングルエージェントと比較して、マルチエージェントシステムでは、言語モデルのコラボレーション機能に対する要件が高くなります。
彼らの協調能力を評価するために、多くのベンチマークが提案されています。
ただし、これらのベンチマークには、LLM の協調機能の詳細な評価が欠けています。
さらに、マルチエージェントの協力シナリオや競争シナリオは既存の作品では無視されています。
これら 2 つの問題に対処するために、BattleAgentBench と呼ばれるベンチマークを提案します。このベンチマークは、3 つの異なる難易度レベルの 7 つのサブステージを定義し、単一エージェントのシナリオナビゲーション機能、ペアエージェントのタスク実行の観点から言語モデルのきめ細かい評価を実施します。
能力、およびマルチエージェントのコラボレーションおよび競争能力。
私たちは、主要な 4 つのクローズドソースモデルと 7 つのオープンソースモデルについて広範な評価を実施しました。
実験結果によると、API ベースのモデルは単純なタスクでは優れたパフォーマンスを発揮しますが、オープンソースの小規模モデルは単純なタスクでは困難を伴います。
協調性と競争力を必要とする困難なタスクに関しては、API ベースのモデルである程度の協調能力が実証されていますが、まだ改善の余地が非常に大きくあります。

要約(オリジナル)

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.

arxiv情報

著者	Wei Wang,Dan Zhang,Tao Feng,Boyan Wang,Jie Tang
発行日	2024-08-28 17:43:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー