HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data

要約

大規模言語モデル (LLM) は、自動コード生成の大きな可能性を示しており、GitHub Copilot などのさまざまなツールの基礎を形成しています。
しかし、最近の調査では、LLM で生成されたコードの多くに重大なセキュリティ脆弱性が含まれていることが明らかになりました。
これまでの研究では、安全なコードを生成するモデルをトレーニングすることでこの問題に対処しようとしましたが、これらの試みはトレーニングデータへのアクセスが制限され、労力を要するデータの準備によって依然として制約を受けています。
この論文では、安全なコードを自動的に合成することで LLM の安全なコードを生成する能力を強化する新しいアプローチである HexaCoder を紹介します。これにより、適切なトレーニングデータを見つける労力が軽減されます。
HexaCoder は、オラクル主導のデータ合成パイプラインと安全なコード生成のための 2 段階のプロセスという 2 つの主要なコンポーネントで構成されています。
データ合成パイプラインは、脆弱なコードを修復するための最先端の LLM を利用することにより、特定の Common Weakness Enumeration (CWE) タイプの脆弱なコードと修正されたコードのペアを生成します。
セキュリティオラクルが脆弱性を特定し、最先端の LLM がコードの拡張および/または編集によって脆弱性を修復し、低ランク適応 (LoRA) メソッドを使用して微調整するためのデータペアを作成します。
微調整データセットの各例には、新しい 2 段階の生成アプローチの基礎を形成する、必要なセキュリティ関連のライブラリとコードが含まれています。
これにより、モデルはメインコードを生成する前にセキュリティ関連のライブラリを統合できるため、生成される脆弱なコードの数がベースライン手法と比較して最大 85% 大幅に削減されます。
私たちは 4 つの LLM に対する 3 つの異なるベンチマークで広範な評価を実行し、HexaCoder が生成されたコードのセキュリティを向上させるだけでなく、高レベルの機能の正確性を維持していることを実証しました。

要約(オリジナル)

Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by training models that generate secure code, these attempts remain constrained by limited access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes by automatically synthesizing secure codes, which reduces the effort of finding suitable training data. HexaCoder comprises two key components: an oracle-guided data synthesis pipeline and a two-step process for secure code generation. The data synthesis pipeline generates pairs of vulnerable and fixed codes for specific Common Weakness Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing vulnerable code. A security oracle identifies vulnerabilities, and a state-of-the-art LLM repairs them by extending and/or editing the codes, creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA) method. Each example of our fine-tuning dataset includes the necessary security-related libraries and code that form the basis of our novel two-step generation approach. This allows the model to integrate security-relevant libraries before generating the main code, significantly reducing the number of generated vulnerable codes by up to 85% compared to the baseline methods. We perform extensive evaluations on three different benchmarks for four LLMs, demonstrating that HexaCoder not only improves the security of the generated code but also maintains a high level of functional correctness.

arxiv情報

著者	Hossein Hajipour,Lea Schönherr,Thorsten Holz,Mario Fritz
発行日	2024-09-10 12:01:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー