ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

要約

公開されている情報には、サイバー脅威インテリジェンス（CTI）の貴重な情報が含まれています。
これは、他のシステムですでに行われている攻撃を防ぐために使用できます。
理想的には、最初の攻撃のみが成功し、その後のすべての攻撃が検出され停止します。
しかし、この情報を交換するにはさまざまな基準がありますが、その多くは標準化されていない方法で記事やブログ投稿で共有されています。
複数のオンラインポータルやニュースページを手動でスキャンして新しい脅威を発見し、それらを抽出することは時間のかかる作業です。
このスキャンプロセスの一部を自動化するために、複数の論文では、自然言語処理（NLP）を使用してドキュメントから妥協（IOC）の指標を抽出する抽出器を提案します。
ただし、これはすでにドキュメントから情報を抽出する問題を解決していますが、これらのドキュメントの検索はめったに考慮されません。
このホワイトペーパーでは、ThreatCrawlと呼ばれる新しい焦点のクローラーが提案されています。これは、変圧器（BERT）ベースのモデルからの双方向エンコーダー表現を使用してドキュメントを分類し、そのクロールパスを動的に適応させます。
ThreatCrawlには、IOCコンテンツなどのテキストに名前が付けられた特定のタイプのオープンソースインテリジェンス（OSINT）を分類するのが困難ですが、関連するドキュメントを正常に見つけて、そのパスを適合させることができます。
それは最大52％の収穫率をもたらします。これは、私たちの知る限り、現在の芸術よりも優れています。
結果とソースコードは、受け入れられると公開されます。

要約(オリジナル)

Publicly available information contains valuable information for Cyber Threat Intelligence (CTI). This can be used to prevent attacks that have already taken place on other systems. Ideally, only the initial attack succeeds and all subsequent ones are detected and stopped. But while there are different standards to exchange this information, a lot of it is shared in articles or blog posts in non-standardized ways. Manually scanning through multiple online portals and news pages to discover new threats and extracting them is a time-consuming task. To automize parts of this scanning process, multiple papers propose extractors that use Natural Language Processing (NLP) to extract Indicators of Compromise (IOCs) from documents. However, while this already solves the problem of extracting the information out of documents, the search for these documents is rarely considered. In this paper, a new focused crawler is proposed called ThreatCrawl, which uses Bidirectional Encoder Representations from Transformers (BERT)-based models to classify documents and adapt its crawling path dynamically. While ThreatCrawl has difficulties to classify the specific type of Open Source Intelligence (OSINT) named in texts, e.g., IOC content, it can successfully find relevant documents and modify its path accord ingly. It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art. The results and source code will be made publicly available upon acceptance.

arxiv情報

著者	Philipp Kuehn,Mike Schmidt,Markus Bayer,Christian Reuter
発行日	2025-03-24 09:14:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー