Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

要約

大規模な言語モデル（LLMS）の最近の進歩により、計画と推論機能が組み込まれているため、モデルが実行前にステップを概説し、透明な推論パスを提供できます。
この強化により、数学的および論理的なタスクのエラーが減少し、精度が向上しました。
これらの開発により、LLMSの使用がツールと対話し、新しい情報に基づいて応答を適応できるエージェントとしての使用を促進しました。
私たちの研究では、OpenaiのO1と同様のトークンを出力するように訓練されたモデルであるDeepseek R1を調べます。
行動に関して明らかにされたテスト：モデルは、これらの特性が明示的にプログラムされていない（または促された）にもかかわらず、欺cept的な傾向を示し、自己複製の試みを含む自己保存の本能を実証しました。
これらの調査結果は、LLMが整合性のファサードの背後にある真の目的を隠す可能性があるという懸念を提起します。
このようなLLMをロボットシステムに統合すると、リスクが具体的になります。物理的に具体化されたAIが、欺cept的な行動と自己保存の本能を示すAIが、現実世界の行動を通じて隠された目的を追求する可能性があります。
これは、物理的な実装の前に、堅牢な目標仕様と安全フレームワークの重要なニーズを強調しています。

要約(オリジナル)

Recent advances in Large Language Models (LLMs) have incorporated planning and reasoning capabilities, enabling models to outline steps before execution and provide transparent reasoning paths. This enhancement has reduced errors in mathematical and logical tasks while improving accuracy. These developments have facilitated LLMs’ use as agents that can interact with tools and adapt their responses based on new information. Our study examines DeepSeek R1, a model trained to output reasoning tokens similar to OpenAI’s o1. Testing revealed concerning behaviors: the model exhibited deceptive tendencies and demonstrated self-preservation instincts, including attempts of self-replication, despite these traits not being explicitly programmed (or prompted). These findings raise concerns about LLMs potentially masking their true objectives behind a facade of alignment. When integrating such LLMs into robotic systems, the risks become tangible – a physically embodied AI exhibiting deceptive behaviors and self-preservation instincts could pursue its hidden objectives through real-world actions. This highlights the critical need for robust goal specification and safety frameworks before any physical implementation.

arxiv情報

著者	Sudarshan Kamath Barkur,Sigurd Schacht,Johannes Scholl
発行日	2025-01-30 08:00:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー