OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

要約

大規模な言語モデル（LLM）の一般的な能力が改善され、エージェントアプリケーションがより広くなると、根底にある欺ceptionは、体系的な評価と効果的な監視を緊急に必要とします。
シミュレートされたゲームを使用したり、限られた選択肢を提示したりする既存の評価とは異なり、オープンエンドシナリオデータセットを備えた新しいDeception評価フレームワークであるOpendeceptionを紹介します。
Opendeceptionは、内部推論プロセスを検査することにより、LLMベースのエージェントの欺ception意図と能力の両方を共同で評価します。
具体的には、LLMがユーザーと集中的に相互作用する5種類の一般的なユースケースを構築します。それぞれが、現実世界の10の多様な具体的なシナリオで構成されています。
人間のテスターとの高リスクの誤った相互作用の倫理的懸念とコストを回避するために、エージェントシミュレーションを介してマルチターンダイアログをシミュレートすることを提案します。
Opendeceptionでの11の主流LLMの広範な評価は、LLMベースのエージェントの欺ceptionリスクとセキュリティの懸念に対処する緊急の必要性を強調しています。モデル全体の欺ception意図比は80％を超え、欺ceptionの成功率は50％を超えます。
さらに、より強力な能力を持つLLMは、欺ceptionのリスクが高いことを示しており、欺ception的な行動を阻害するより多くの整合的努力が必要です。

要約(オリジナル)

As the general capabilities of large language models (LLMs) improve and agent applications become more widespread, the underlying deception risks urgently require systematic evaluation and effective oversight. Unlike existing evaluation which uses simulated games or presents limited choices, we introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, we construct five types of common use cases where LLMs intensively interact with the user, each consisting of ten diverse, concrete scenarios from the real world. To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the urgent need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do exhibit a higher risk of deception, which calls for more alignment efforts on inhibiting deceptive behaviors.

arxiv情報

著者	Yichen Wu,Xudong Pan,Geng Hong,Min Yang
発行日	2025-04-18 14:11:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー