Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation

要約

マルウェアの著者は、多くの場合、コード難読化を採用して、マルウェアを検出しにくくします。
難読化されたコードを生成するための既存のツールは、多くの場合、元のソースコード（C ++やJavaなど）にアクセスする必要があり、新しい難読化を追加するには、非自明の労働集約的なプロセスです。
この研究では、次の質問をします。大規模な言語モデル（LLM）は、新しい難読化されたアセンブリコードを潜在的に生成できますか？
その場合、これはアンチウイルスエンジンにリスクをもたらし、攻撃者の柔軟性を高めるために新しい難読化パターンを作成する可能性があります。
これは、変成データセット（MAD）を含む変態ベンチマークを開発し、3つのコード難読化テクニック、つまり死んだコード、登録代替、および制御フローの変化を開発することにより、肯定的に答えます。
変態は、328,200の難読化されたアセンブリコードサンプルを含むMADを使用して、LLMSが難読化されたコードを生成および分析する能力を体系的に評価します。
このデータセットをリリースし、さまざまなLLMS（例：GPT-3.5/4、GPT-4O-MINI、STARCODER、CODEGEMMA、CODELLAMA、CODET5、およびLLAMA 3.1）の成功率を分析します。
評価は、確立された情報理論の指標と手動の人間のレビューを使用して実行され、正確性を確保し、研究者がこのリスクの是正を研究および開発するための基盤を提供しました。

要約(オリジナル)

Malware authors often employ code obfuscations to make their malware harder to detect. Existing tools for generating obfuscated code often require access to the original source code (e.g., C++ or Java), and adding new obfuscations is a non-trivial, labor-intensive process. In this study, we ask the following question: Can Large Language Models (LLMs) potentially generate a new obfuscated assembly code? If so, this poses a risk to anti-virus engines and potentially increases the flexibility of attackers to create new obfuscation patterns. We answer this in the affirmative by developing the MetamorphASM benchmark comprising MetamorphASM Dataset (MAD) along with three code obfuscation techniques: dead code, register substitution, and control flow change. The MetamorphASM systematically evaluates the ability of LLMs to generate and analyze obfuscated code using MAD, which contains 328,200 obfuscated assembly code samples. We release this dataset and analyze the success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder, CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly code. The evaluation was performed using established information-theoretic metrics and manual human review to ensure correctness and provide the foundation for researchers to study and develop remediations to this risk.

arxiv情報

著者	Seyedreza Mohseni,Seyedali Mohammadi,Deepa Tilwani,Yash Saxena,Gerald Ketu Ndawula,Sriram Vema,Edward Raff,Manas Gaur
発行日	2025-01-29 13:52:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー