CyberMetric: A Benchmark Dataset for Evaluating Large Language Models Knowledge in Cybersecurity

要約

大規模言語モデル (LLM) は、コンピュータービジョンから医療診断に至るまで、さまざまな領域で優れた性能を発揮します。
しかし、暗号化、リバースエンジニアリング、リスク評価などの管理面を含むサイバーセキュリティの多様な状況を理解することは、人間の専門家にとってさえ困難を伴います。
このペーパーでは、サイバーセキュリティ分野の標準、認定、研究論文、書籍、その他の出版物から得た 10,000 の質問で構成されるベンチマークデータセットである CyberMetric を紹介します。
質問は共同プロセス、つまり専門知識と GPT-3.5 や Falcon-180B などの LLM を統合することによって作成されます。
人間の専門家が 200 時間以上を費やして、その正確性と関連性を検証しました。
LLM の知識を評価するだけでなく、このデータセットの主な目的は、サイバーセキュリティにおける人間とさまざまな LLM との公正な比較を促進することです。
これを達成するために、サイバーセキュリティの幅広いトピックをカバーする 80 の質問を慎重に選択し、さまざまな専門レベルの 30 人の参加者を参加させ、この分野における人間の知能と機械の知能の包括的な比較を促進しました。
その結果、LLM はサイバーセキュリティのほぼすべての側面で人間よりも優れたパフォーマンスを発揮したことが明らかになりました。

要約(オリジナル)

Large Language Models (LLMs) excel across various domains, from computer vision to medical diagnostics. However, understanding the diverse landscape of cybersecurity, encompassing cryptography, reverse engineering, and managerial facets like risk assessment, presents a challenge, even for human experts. In this paper, we introduce CyberMetric, a benchmark dataset comprising 10,000 questions sourced from standards, certifications, research papers, books, and other publications in the cybersecurity domain. The questions are created through a collaborative process, i.e., merging expert knowledge with LLMs, including GPT-3.5 and Falcon-180B. Human experts spent over 200 hours verifying their accuracy and relevance. Beyond assessing LLMs’ knowledge, the dataset’s main goal is to facilitate a fair comparison between humans and different LLMs in cybersecurity. To achieve this, we carefully selected 80 questions covering a wide range of topics within cybersecurity and involved 30 participants of diverse expertise levels, facilitating a comprehensive comparison between human and machine intelligence in this area. The findings revealed that LLMs outperformed humans in almost every aspect of cybersecurity.

arxiv情報

著者	Norbert Tihanyi,Mohamed Amine Ferrag,Ridhi Jain,Merouane Debbah
発行日	2024-02-12 14:53:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CyberMetric: A Benchmark Dataset for Evaluating Large Language Models Knowledge in Cybersecurity

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー