Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge

要約

ソフトウェアエンジニアリングタスクにおけるコード言語モデルの広範な採用により、敵対的な攻撃、特に識別子置換攻撃に対する脆弱性が明らかになりました。
既存の識別子代替攻撃者は高い成功率を示していますが、しばしば不自然なコードパターンを備えた敵対的な例を作成します。
この論文では、LLM-as-a-judgeを使用して敵対例の質を体系的に評価します。
私たちの分析は、最先端の識別子置換攻撃者（例えば、アラート）によって生成された敵対例の80％以上が実際に検出可能であることを明らかにしています。
この洞察に基づいて、自然性の推論を介して識別子置換攻撃を評価および精製するための統一されたフレームワークであるEp-Shieldを提案します。
具体的には、最初にコードの自然性を評価し、摂動した敵対コードを特定し、それを浄化して、被害者モデルが正しい予測を復元できるようにします。
広範な実験は、GPT-4レベルのパフォーマンスを備えた敵対的な微調整（最大83.36％の改善）およびその軽量設計7Bパラメーター）よりもEPシールドの優位性を示しています。

要約(オリジナル)

The widespread adoption of code language models in software engineering tasks has exposed vulnerabilities to adversarial attacks, especially the identifier substitution attacks. Although existing identifier substitution attackers demonstrate high success rates, they often produce adversarial examples with unnatural code patterns. In this paper, we systematically assess the quality of adversarial examples using LLM-as-a-Judge. Our analysis reveals that over 80% of adversarial examples generated by state-of-the-art identifier substitution attackers (e.g., ALERT) are actually detectable. Based on this insight, we propose EP-Shield, a unified framework for evaluating and purifying identifier substitution attacks via naturalness-aware reasoning. Specifically, we first evaluate the naturalness of code and identify the perturbed adversarial code, then purify it so that the victim model can restore correct prediction. Extensive experiments demonstrate the superiority of EP-Shield over adversarial fine-tuning (up to 83.36% improvement) and its lightweight design 7B parameters) with GPT-4-level performance.

arxiv情報

著者	Wenhan Mu,Ling Xu,Shuren Pei,Le Mi,Huichi Zhou
発行日	2025-04-28 12:28:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー