DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

要約

大規模な言語モデル（LLM）エージェントは、人間の言語の理解と推論に印象的な能力を示していますが、サイバーセキュリティの可能性は未習性のままです。
攻撃、防衛、およびサイバーセキュリティの知識ベースのタスクを介して言語エージェントを評価するための実用的でオープンソースツールキットであるDefenderBenchを紹介します。
DefenderBenchには、ネットワーク侵入の環境、悪意のあるコンテンツの検出、コードの脆弱性分析、サイバーセキュリティの知識評価が含まれます。
公正かつ厳密な評価を提供しながら、研究者にとって手頃な価格で簡単にアクセスできるように意図的に設計されています。
標準化されたエージェントフレームワークを使用して、オープンおよび閉じた重量モデルの両方を含む、いくつかの最先端（SOTA）と人気のLLMをベンチマークします。
我々の結果は、Claude-3.7-Sonnetが81.65のDefenderBenchスコアで最高のパフォーマンスを発揮し、78.40でClaude-3.7-Sonnet-Thinkが続いていることを示しています。
DefenderBenchのモジュラー設計により、カスタムLLMとタスクのシームレスな統合が可能になり、再現性と公正な比較が促進されます。
DefenderBenchの匿名バージョンは、https：//github.com/microsoft/defenderbenchで入手できます。

要約(オリジナル)

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench’s modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

arxiv情報

著者	Chiyu Zhang,Marc-Alexandre Cote,Michael Albada,Anush Sankaran,Jack W. Stokes,Tong Wang,Amir Abdi,William Blum,Muhammad Abdul-Mageed
発行日	2025-06-10 17:00:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー