How adversarial attacks can disrupt seemingly stable accurate classifiers

要約

敵対的攻撃は、入力データの一部に対する一見取るに足らない変更を使用して、他の点では正確な学習システムの出力を劇的に変化させます。
逆説的ですが、経験的証拠は、入力データの大きなランダムな摂動に対して堅牢なシステムであっても、入力の小さくて簡単に構築できる敵対的な摂動に対して依然として影響を受けやすいことを示しています。
ここでは、これが高次元の入力データを処理する分類器の基本的な特徴とみなされる可能性があることを示します。
我々は、実際のシステムで観察される主要な動作が高確率で発生する、単純で汎用的で一般化可能なフレームワークを導入します。特に、簡単に構築された敵対的攻撃に対する (そうでなければ正確な) モデルの同時感受性と、入力データのランダムな摂動に対する堅牢性です。
標準的な画像分類問題で訓練された実際のニューラルネットワークでも同じ現象が直接観察され、大きな付加的なランダムノイズであってもネットワークの敵対的不安定性を引き起こすことができないことを確認しました。
驚くべき点は、分類器の決定曲面をトレーニングデータとテストデータから隔てるわずかなマージンであっても、ランダムにサンプリングされた摂動を使用して敵対的感受性を検出できないようにすることができるということです。
したがって、直観に反しますが、トレーニングまたはテスト中に加算ノイズを使用することは、敵対的な例を根絶したり検出するには非効率的であり、より要求の厳しい敵対的なトレーニングが必要になります。

要約(オリジナル)

Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability — notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier’s decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

arxiv情報

著者	Oliver J. Sutton,Qinghua Zhou,Ivan Y. Tyukin,Alexander N. Gorban,Alexander Bastounis,Desmond J. Higham
発行日	2023-09-07 12:02:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

How adversarial attacks can disrupt seemingly stable accurate classifiers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー