Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

要約

大規模な言語モデル（LLM）は、自然言語の理解と生成を進めることにより、人工知能を変革し、ヘルスケア、ソフトウェアエンジニアリング、会話システムを超えた分野全体のアプリケーションを可能にしました。
過去数年間のこれらの進歩にもかかわらず、LLMは、特に迅速な注入と侵入攻撃に対して、かなりの脆弱性を示してきました。
このレビューは、これらの脆弱性に関する研究の状態を分析し、利用可能な防衛戦略を提示します。
攻撃アプローチを、敵対的なプロンプトニング、バックドアインジェクション、クロスモダリティエクスプロイトなど、攻撃アプローチをプロンプトベース、モデルベース、マルチモーダル、多言語のカバーテクニックに分類します。
また、迅速なフィルタリング、変換、アライメントテクニック、マルチエージェント防御、自己規制など、さまざまな防御メカニズムをレビューし、その強みと欠点を評価します。
また、LLMの安全性と堅牢性を評価するために使用される主要なメトリックとベンチマークについても説明し、既存のデータセットのインタラクティブなコンテキストやバイアスでの攻撃の成功の定量化などの課題に注目します。
現在の研究のギャップを特定すると、回復力のあるアライメント戦略、進化する攻撃に対する高度な防御、脱獄の検出の自動化、および倫理的および社会的影響の考慮の将来の方向性を提案します。
このレビューは、LLMセキュリティを強化し、安全な展開を確保するために、AIコミュニティ内で継続的な研究と協力の必要性を強調しています。

要約(オリジナル)

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

arxiv情報

著者	Benji Peng,Keyu Chen,Qian Niu,Ziqian Bi,Ming Liu,Pohsun Feng,Tianyang Wang,Lawrence K. Q. Yan,Yizhu Wen,Yichao Zhang,Caitlyn Heqi Yin
発行日	2025-05-08 13:35:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー