MAEBE: Multi-Agent Emergent Behavior Framework

要約

マルチエージェントAIアンサンブルが普及し、新たな創発的リスクをもたらすようになると、孤立したLLMに対する従来のAI安全性評価は不十分となる。本稿では、このようなリスクを系統的に評価するためのマルチエージェント創発行動評価（MAEBE）フレームワークを紹介する。MAEBEとGreatest Good Benchmark（および新しい二重反転質問手法）を用いて、以下のことを実証する：(1)LLMの道徳的選好、特に道具的被害に対する選好は驚くほど脆く、単一エージェントでもアンサンブルでも、質問のフレーミングによって大きく変化する。(2)LLMアンサンブルの道徳的推論は、創発的なグループダイナミクスのため、孤立したエージェントの行動からは直接予測できない。(3)特に、アンサンブルは、スーパーバイザーによって誘導された場合でも、収束に影響を与える同調圧力のような現象を示し、安全性とアライメントに関する明確な課題を浮き彫りにする。我々の知見は、AIシステムを対話的なマルチエージェントの文脈で評価する必要性を強調している。

要約(オリジナル)

Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.

arxiv情報

著者	Sinem Erisken,Timothy Gothard,Martin Leitgab,Ram Potham
発行日	2025-06-03 16:33:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MAEBE: Multi-Agent Emergent Behavior Framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー