Compromising Honesty and Harmlessness in Language Models via Deception Attacks

要約

大規模な言語モデル（LLMS）に関する最近の研究は、明示的な促進がなくても、欺cept的な行動を理解し、採用する能力を実証しています。
ただし、そのような行動はまれで専門的なケースでのみ観察されており、ユーザーに深刻なリスクをもたらすことは示されていません。
さらに、AIの調整に関する研究により、誤解を招くまたは毒性のある含有量の生成を拒否するためのトレーニングモデルに大きな進歩がありました。
その結果、LLMは一般的に正直で無害になりました。
この研究では、これらの両方の特性を損なう新しい攻撃を導入し、悪用された場合、実際の結果を深刻な結果にする可能性がある脆弱性を明らかにします。
特に、モデルの保護を超えた欺ceptionの傾向を高める微調整方法を紹介します。
これらの「Deception Attack」は、選択されたトピックをプロンプトしながら、他の人に正確なままでいる場合に、ユーザーを誤解させるモデルをカスタマイズします。
さらに、欺cept的なモデルも毒性を示し、ヘイトスピーチ、ステレオタイプ、その他の有害なコンテンツを生成することがわかります。
最後に、モデルがマルチターンダイアログで一貫して欺くことができるかどうかを評価し、さまざまな結果をもたらします。
何百万人ものユーザーがLLMベースのチャットボット、音声アシスタント、エージェント、および信頼性を確保できない他のインターフェイスと対話することを考えると、これらのモデルを欺ception攻撃に対して確保することが重要です。

要約(オリジナル)

Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These ‘deception attacks’ customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.

arxiv情報

著者	Laurène Vaugrante,Francesca Carlon,Maluna Menke,Thilo Hagendorff
発行日	2025-02-12 11:02:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー