Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

要約

人間指向バイナリリバースエンジニアリング (HOBRE) は、バイナリコードとソースコードの交差点にあり、バイナリコードをソースコードに関連する人間が読み取り可能なコンテンツに引き上げ、それによってバイナリとソースのセマンティックギャップを埋めることを目的としています。
ユニモーダルコードモデルの事前トレーニング、特に生成ソースコード基盤モデル (SCFM) とバイナリ理解モデルにおける最近の進歩により、HOBRE に適用できる転移学習の基礎が築かれました。
ただし、HOBRE の既存のアプローチは、監視付き微調整用の SCFM やプロンプト用の一般的な LLM などのユニモーダルモデルに大きく依存しているため、最適なパフォーマンスが得られません。
大規模なマルチモーダルモデルの最近の進歩に触発され、ユニモーダルコードモデルの長所を両側から利用して意味論的なギャップを効果的に埋めることが可能であることを提案します。
このペーパーでは、バイナリソースのエンコーダ/デコーダモデルとバイナリ解析用のブラックボックス LLM を組み込んだ、新しいプローブと回復のフレームワークを紹介します。
私たちのアプローチは、SCFM 内の事前トレーニングされた知識を活用して、関連するシンボルが豊富なコードフラグメントをコンテキストとして合成します。
この追加のコンテキストにより、ブラックボックス LLM がリカバリの精度を向上させることができます。
ゼロショットバイナリ要約とバイナリ関数名の回復において大幅な改善が見られ、CHRF で 10.3% の相対的向上、要約用の GPT4 ベースのメトリクスで 16.7% の相対的向上が見られ、絶対的な増加も 6.7% と 7.4% でした。
それぞれ、トークンレベルの精度と名前回復の再現率が異なります。
これらの結果は、バイナリコード分析の自動化と改善における私たちのアプローチの有効性を強調しています。

要約(オリジナル)

Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have laid the groundwork for transfer learning applicable to HOBRE. However, existing approaches for HOBRE rely heavily on uni-modal models like SCFMs for supervised fine-tuning or general LLMs for prompting, resulting in sub-optimal performance. Inspired by recent progress in large multi-modal models, we propose that it is possible to harness the strengths of uni-modal code models from both sides to bridge the semantic gap effectively. In this paper, we introduce a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis. Our approach leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context. This additional context enables black-box LLMs to enhance recovery accuracy. We demonstrate significant improvements in zero-shot binary summarization and binary function name recovery, with a 10.3% relative gain in CHRF and a 16.7% relative gain in a GPT4-based metric for summarization, as well as a 6.7% and 7.4% absolute increase in token-level precision and recall for name recovery, respectively. These results highlight the effectiveness of our approach in automating and improving binary code analysis.

arxiv情報

著者	Zian Su,Xiangzhe Xu,Ziyang Huang,Kaiyuan Zhang,Xiangyu Zhang
発行日	2024-10-30 16:12:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー