On Trojan Signatures in Large Language Models of Code

要約

Fields らによって説明されているトロイの木馬のシグネチャ。
(2021) は、トロイの木馬モデルのトロイの木馬クラスパラメーター (重み) とトロイの木馬モデルの非トロイの木馬クラスパラメーターの分布に顕著な違いがあり、これを使用してトロイの木馬モデルを検出できます。
フィールズら。
(2021) Resnet、 WideResnet、Densenet、VGG などの画像モデルを使用したコンピュータービジョン分類タスクでトロイの木馬のシグネチャを発見しました。
この論文では、ソースコードの大規模言語モデルの分類子層パラメーターにおけるそのような署名を調査します。
私たちの結果は、トロイの木馬のシグネチャがコードの LLM に一般化できないことを示唆しています。
トロイの木馬コードモデルは、より明示的な設定 (事前にトレーニングされた重みが凍結されて微調整された) で汚染された場合でも、頑固であることがわかりました。
私たちは、クローンと欠陥検出という 2 つのバイナリ分類タスクについて、9 つのトロイの木馬モデルを分析しました。
私たちの知る限り、これはコードの大規模言語モデルに対する重みベースのトロイの木馬の署名暴露技術を調査し、さらにそのようなモデルの重みだけからトロイの木馬を検出することが難しい問題であることを実証した最初の研究です。

要約(オリジナル)

Trojan signatures, as described by Fields et al. (2021), are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model, that can be used to detect the trojaned model. Fields et al. (2021) found trojan signatures in computer vision classification tasks with image models, such as, Resnet, WideResnet, Densenet, and VGG. In this paper, we investigate such signatures in the classifier layer parameters of large language models of source code. Our results suggest that trojan signatures could not generalize to LLMs of code. We found that trojaned code models are stubborn, even when the models were poisoned under more explicit settings (finetuned with pre-trained weights frozen). We analyzed nine trojaned models for two binary classification tasks: clone and defect detection. To the best of our knowledge, this is the first work to examine weight-based trojan signature revelation techniques for large-language models of code and furthermore to demonstrate that detecting trojans only from the weights in such models is a hard problem.

arxiv情報

著者	Aftab Hussain,Md Rafiqul Islam Rabin,Mohammad Amin Alipour
発行日	2024-03-07 15:59:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On Trojan Signatures in Large Language Models of Code

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー