Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets

要約

コード関連のタスクに大規模な言語モデル（LLM）を採用することは、トレーニングデータセットのセキュリティに関する懸念を提起しました。
重要な脅威の1つは、モデルの動作を操作するためのトレーニングデータに構文的に有効だが機能的に冗長コードが注入されているデッドコード中毒です。
このような攻撃は、ニューラルコード検索システムのパフォーマンスを低下させ、偏ったコードの提案または不安定なコードの提案につながる可能性があります。
トークンレベルの困惑分析などの既存の検出方法は、プログラミング言語の構造的およびコンテキスト特性により、死んだコードを効果的に特定できません。
この論文では、コードの構造特性に合わせた新しいラインレベルの検出およびクレンジング方法であるDEPA（Dead Code Perplexity Analysis）を提案します。
DEPAは、コード行間のコンテキスト関係を活用することにより、ラインレベルの困惑を計算し、ファイル内の全体的な分布と困惑を比較することにより、異常な行を識別します。
ベンチマークデータセットでの実験は、DEPAが既存の方法を大幅に上回り、検出F1スコアの0.14-0.19の改善を達成し、中毒セグメント局在精度の44-65％の増加を達成することを示しています。
さらに、DEPAは検出速度を0.62〜23倍に強化し、大規模なデータセットクレンジングに実用的です。
全体として、DEPAは、死んだコード中毒の独自の課題に対処することにより、コード生成モデルトレーニングデータセットの整合性を保護するための堅牢で効率的なソリューションを提供します。

要約(オリジナル)

The increasing adoption of large language models (LLMs) for code-related tasks has raised concerns about the security of their training datasets. One critical threat is dead code poisoning, where syntactically valid but functionally redundant code is injected into training data to manipulate model behavior. Such attacks can degrade the performance of neural code search systems, leading to biased or insecure code suggestions. Existing detection methods, such as token-level perplexity analysis, fail to effectively identify dead code due to the structural and contextual characteristics of programming languages. In this paper, we propose DePA (Dead Code Perplexity Analysis), a novel line-level detection and cleansing method tailored to the structural properties of code. DePA computes line-level perplexity by leveraging the contextual relationships between code lines and identifies anomalous lines by comparing their perplexity to the overall distribution within the file. Our experiments on benchmark datasets demonstrate that DePA significantly outperforms existing methods, achieving 0.14-0.19 improvement in detection F1-score and a 44-65% increase in poisoned segment localization precision. Furthermore, DePA enhances detection speed by 0.62-23x, making it practical for large-scale dataset cleansing. Overall, by addressing the unique challenges of dead code poisoning, DePA provides a robust and efficient solution for safeguarding the integrity of code generation model training datasets.

arxiv情報

著者	Chichien Tsai,Chiamu Yu,Yingdar Lin,Yusung Wu,Weibin Lee
発行日	2025-02-27 16:30:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー