Jailbreaking Large Language Models with Symbolic Mathematics

要約

AI の安全性における最近の進歩により、安全でないコンテンツの生成を軽減するために大規模言語モデル (LLM) のトレーニングとレッドチーム化の取り組みが強化されています。
ただし、これらの安全メカニズムは包括的ではない可能性があり、潜在的な脆弱性が未調査のままになっています。
この文書では、記号数学における LLM の高度な機能を利用して安全メカニズムをバイパスする新しいジェイルブレイク手法である MathPrompt を紹介します。
有害な自然言語プロンプトを数学的問題にエンコードすることで、現在の AI 安全対策の重大な脆弱性を実証します。
13 の最先端 LLM を対象とした実験では、平均攻撃成功率が 73.6\% であることが明らかになり、既存の安全トレーニングメカニズムが数学的にエンコードされた入力に一般化できないことが浮き彫りになりました。
埋め込みベクトルの分析では、元のプロンプトとエンコードされたプロンプトの間で大幅な意味の変化が示されており、攻撃の成功の説明に役立ちます。
この取り組みは、AI の安全性に対する総合的なアプローチの重要性を強調しており、あらゆる潜在的な入力タイプとそれに関連するリスクにわたって堅牢な安全対策を開発するためのレッドチームの取り組みの拡大を求めています。

要約(オリジナル)

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs’ advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack’s success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

arxiv情報

著者	Emet Bethany,Mazal Bethany,Juan Arturo Nolazco Flores,Sumit Kumar Jha,Peyman Najafirad
発行日	2024-11-05 08:46:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Jailbreaking Large Language Models with Symbolic Mathematics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー