Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

要約

テキストの品質コードの正確性と引数の強さを評価するために使用される裁判官システムとしてのLLMは、迅速な噴射攻撃に対して脆弱です。
コンテンツ作者の攻撃をシステムプロンプト攻撃から分離するフレームワークを紹介し、5つのモデルGEMMA 3.4B LLAMA 3.2 3B GPT 4およびCLAUDE 3 OPUSを条件ごとに50のプロンプトを使用して、さまざまな防御を持つ4つのタスクに評価します。
最大73ポイントの8％の成功を達成した攻撃は、より脆弱であることが判明し、50ポイントから62ポイントの6パーセントの範囲が譲渡可能であることが判明しました。
私たちの結果は、ユニバーサルの迅速なインジェクションとAdvprompterとは対照的に、マルチモデル委員会と比較スコアリングとすべてのコードとデータセットをリリースすることをお勧めします

要約(オリジナル)

LLM as judge systems used to assess text quality code correctness and argument strength are vulnerable to prompt injection attacks. We introduce a framework that separates content author attacks from system prompt attacks and evaluate five models Gemma 3.27B Gemma 3.4B Llama 3.2 3B GPT 4 and Claude 3 Opus on four tasks with various defenses using fifty prompts per condition. Attacks achieved up to seventy three point eight percent success smaller models proved more vulnerable and transferability ranged from fifty point five to sixty two point six percent. Our results contrast with Universal Prompt Injection and AdvPrompter We recommend multi model committees and comparative scoring and release all code and datasets

arxiv情報

著者	Narek Maloyan,Dmitry Namiot
発行日	2025-04-25 13:18:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー