Less is more? Rewards in RL for Cyber Defence

要約

過去数年間、深い強化学習に基づいて、自律的なサイバー防衛エージェントへの関心の爆発が見られました。
このようなエージェントは通常、サイバーシミュレーターとしても知られているサイバージム環境で訓練されており、そのうち少なくとも32はすでに構築されています。
ほとんどの場合、すべてではないにしても、すべてのサイバージムは、さまざまな（国連）望ましい州と費用のかかる行動に対する多くの罰則またはインセンティブを組み合わせた密な「足場」報酬機能を提供します。
密集した報酬は、複雑な環境を探索するという課題を緩和するのに役立ちますが、環境の手順が比較的少ないと思われる戦略をもたらします。
また、エージェントが見つけることができるソリューションにバイアスをかけることも知られています。
これは、敵によって悪用されるまで政策の弱点が気付かない複雑なサイバー環境では特に問題です。
この作業では、まばらな報酬関数がより効果的なサイバー防衛エージェントをトレーニングできるかどうかを評価することにしました。
この目標に向けて、最初に、エージェントの訓練と評価に使用される標準のRLパラダイムを超えるグラウンドトゥルース評価スコアを提案することにより、既存の作業のいくつかの評価制限を分析します。
確立されたサイバージムを適応させて方法論とグラウンドトゥルーススコアに対応することにより、2つのまばらな報酬メカニズムを提案および評価し、それらを典型的な密な報酬と比較します。
私たちの評価では、2〜50ノードのネットワークサイズの範囲、および反応的および積極的な防御アクションの両方を考慮しています。
我々の結果は、まばらな報酬、特に妥協のないネットワーク状態に対する肯定的な強化により、より効果的なサイバー防衛エージェントのトレーニングが可能になることを示しています。
さらに、まばらな報酬が密集した報酬よりも安定したトレーニングを提供し、効果とトレーニングの安定性の両方が、さまざまなサイバー環境の考慮事項に堅牢であることを示しています。

要約(オリジナル)

The last few years have seen an explosion of interest in autonomous cyber defence agents based on deep reinforcement learning. Such agents are typically trained in a cyber gym environment, also known as a cyber simulator, at least 32 of which have already been built. Most, if not all cyber gyms provide dense ‘scaffolded’ reward functions which combine many penalties or incentives for a range of (un)desirable states and costly actions. Whilst dense rewards help alleviate the challenge of exploring complex environments, yielding seemingly effective strategies from relatively few environment steps; they are also known to bias the solutions an agent can find, potentially towards suboptimal solutions. This is especially a problem in complex cyber environments where policy weaknesses may not be noticed until exploited by an adversary. In this work we set out to evaluate whether sparse reward functions might enable training more effective cyber defence agents. Towards this goal we first break down several evaluation limitations in existing work by proposing a ground truth evaluation score that goes beyond the standard RL paradigm used to train and evaluate agents. By adapting a well-established cyber gym to accommodate our methodology and ground truth score, we propose and evaluate two sparse reward mechanisms and compare them with a typical dense reward. Our evaluation considers a range of network sizes, from 2 to 50 nodes, and both reactive and proactive defensive actions. Our results show that sparse rewards, particularly positive reinforcement for an uncompromised network state, enable the training of more effective cyber defence agents. Furthermore, we show that sparse rewards provide more stable training than dense rewards, and that both effectiveness and training stability are robust to a variety of cyber environment considerations.

arxiv情報

著者	Elizabeth Bates,Chris Hicks,Vasilios Mavroudis
発行日	2025-03-10 15:51:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Less is more? Rewards in RL for Cyber Defence

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー