Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash

要約

大規模言語モデル (LLM) は、複雑なタスクやインタラクティブな環境において優れた機能を示していますが、その創造性は依然として十分に解明されていません。
この論文では、LLM の創造性と論理的推論の両方を評価するために、ゲーム Balderdash を利用したシミュレーションフレームワークを紹介します。
Balderdash では、プレーヤーは、正しい定義を特定しながら、不明瞭な用語の架空の定義を生成して他人を欺きます。
私たちのフレームワークにより、複数の LLM エージェントがこのゲームに参加し、ゲームのルールと履歴に基づいて妥当な定義を作成し、戦略を立てる能力を評価することができます。
私たちは、参加者としてさまざまな LLM を特徴とする集中型ゲームエンジンと、意味上の同等性を評価するためのジャッジ LLM を実装しました。
一連の実験を通じて、真の定義率、欺瞞率、正解率などの指標を調べて、さまざまな LLM のパフォーマンスを分析しました。
結果は、LLM の創造的および欺瞞的な能力についての洞察を提供し、LLM の強みと改善の余地を浮き彫りにします。
具体的には、この研究では、LLM の入力に語彙が少ないため、ゲームのルールや歴史的背景に関する推論が不十分になることが明らかになりました (https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash)。

要約(オリジナル)

Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs’ input leads to poor reasoning on game rules and historical context (https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash).

arxiv情報

著者	Parsa Hejabi,Elnaz Rahmati,Alireza S. Ziabari,Preni Golazizian,Jesse Thomason,Morteza Dehghani
発行日	2024-11-15 18:42:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー