SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

要約

強化学習 (RL) は活発に成長している分野であり、現実世界の安全性が重要なアプリケーションでの使用が増加しており、敵対的な攻撃に対する RL アルゴリズムの堅牢性を確保することが最も重要です。
この研究では、RL に対するトレーニング時の特にステルスな形式の攻撃、つまりバックドアポイズニングを調査します。
ここで、敵対者は、エージェントが推論時に事前に決定されたトリガーを観察したときに特定のアクションを確実に誘発することを目的として、RL エージェントのトレーニングを傍受します。
私たちは、ドメインや MDP 間で一般化できないことを証明することで、以前の研究の理論的限界を明らかにします。
これを動機として、私たちは敵対者の目的と最適なポリシーを見つけるという目的を結び付け、限界内で攻撃の成功を保証する新しいポイズニング攻撃フレームワークを策定します。
理論分析からの洞察を使用して、新しく提案された脅威モデルを悪用し、動的報酬ポイズニング技術を活用する普遍的なバックドア攻撃として「SleeperNets」を開発します。
複数のドメインにまたがる 6 つの環境で攻撃を評価し、無害な一時的なリターンを維持しながら、既存の方法と比べて攻撃の成功率が大幅に向上していることを実証しました。

要約(オリジナル)

Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications — making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL — backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary’s objectives with those of finding an optimal policy — guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop “SleeperNets” as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.

arxiv情報

著者	Ethan Rathbun,Christopher Amato,Alina Oprea
発行日	2024-10-21 16:44:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー