Attacking Large Language Models with Projected Gradient Descent

要約

現在のLLMアライメント手法は、特別に細工された敵対的プロンプトによって容易に破られる。離散最適化を用いて敵対的プロンプトを作成することは非常に効果的であるが、このような攻撃は通常10万回以上のLLMコールを使用する。このように計算コストが高いため、定量的な分析や敵対的な訓練などには不向きである。これを改善するために、連続的に緩和された入力プロンプトに対する射影勾配降下法（PGD）を再検討する。通常の勾配に基づく攻撃はほとんど失敗したが、連続緩和によってもたらされる誤差を注意深く制御することで、その有効性が飛躍的に向上することを示す。LLMに対する我々のPGDは、同じ破壊的な攻撃結果を達成するために、最新の離散最適化よりも最大1桁高速である。

要約(オリジナル)

Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. To remedy this, we revisit Projected Gradient Descent (PGD) on the continuously relaxed input prompt. Although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.

arxiv情報

著者	Simon Geisler,Tom Wollschläger,M. H. I. Abdalla,Johannes Gasteiger,Stephan Günnemann
発行日	2025-03-03 09:37:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Attacking Large Language Models with Projected Gradient Descent

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー