Robust Feature-Level Adversaries are Interpretability Tools

要約

コンピュータビジョンにおける敵対的攻撃に関する文献は通常、ピクセルレベルの摂動に焦点を当てています。
これらは解釈が非常に難しい傾向があります。
画像ジェネレーターの潜在表現を操作して「機能レベル」の敵対的摂動を作成する最近の研究は、知覚可能で解釈可能な敵対的攻撃を調査する機会を与えてくれます。
私たちは 3 つの貢献を行っています。
まず、特徴レベルの攻撃が、モデル内の表現を研究するための有用な入力クラスを提供することを観察しました。
第 2 に、これらの攻撃者は独特の多用途性と非常に堅牢であることを示します。
これらを使用して、ターゲットを絞った、普遍的な、偽装された、物理的に実現可能な、ブラックボックス攻撃を ImageNet 規模で生成できることを実証します。
第三に、これらの敵対的な画像をネットワーク内のバグを特定するための実用的な解釈ツールとしてどのように使用できるかを示します。
これらの攻撃者を使用して、特徴とクラスの間の偽の関連性について予測を行い、その後、ある自然画像を別の自然画像に貼り付けてターゲットを絞った誤分類を引き起こす「コピー/ペースト」攻撃を設計することでテストします。
私たちの結果は、機能レベルの攻撃が厳密な解釈可能性研究にとって有望なアプローチであることを示唆しています。
これらは、モデルが学習した内容をより深く理解し、脆弱な特徴の関連性を診断するためのツールの設計をサポートします。
コードは https://github.com/thestephencasper/feature_level_adv で入手できます。

要約(オリジナル)

The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create ‘feature-level’ adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing ‘copy/paste’ attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv

arxiv情報

著者	Stephen Casper,Max Nadeau,Dylan Hadfield-Menell,Gabriel Kreiman
発行日	2023-09-11 16:31:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robust Feature-Level Adversaries are Interpretability Tools

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー