Correctness Assessment of Code Generated by Large Language Models Using Internal Representations

要約

大規模言語モデル (LLM) によって生成されたコードの正確性を保証することは、AI 主導のソフトウェア開発において大きな課題となります。
既存のアプローチは主に、生成後に正確さを評価するブラックボックス (クローズドボックス) アプローチに依存しており、コード生成中に LLM の内部状態に埋め込まれた豊富な洞察を利用できません。
このペーパーでは、これらの内部表現を利用して LLM で生成されたコードの正確さを評価する新しいホワイトボックス (オープンボックス) フレームワークである OPENIA を紹介します。
OPENIA は、DeepSeek-Coder、CodeLlama、MagicCoder など、コードに特化した代表的なオープンソース LLM の中間状態を、さまざまなコード生成ベンチマークにわたって体系的に分析します。
私たちの経験的分析により、これらの内部表現は潜在的な情報をエンコードしており、生成されたコードの正確さと強く相関していることが明らかになりました。
これらの洞察に基づいて、OPENIA はホワイトボックス/オープンボックスアプローチを使用してコードの正確性について情報に基づいた予測を行い、従来の分類ベースの方法やゼロショットアプローチと比較して適応性と堅牢性において大きな利点を提供します。
実験結果は、OPENIA がベースラインモデルを常に上回っており、スタンドアロンコード生成で最大 2 倍の改善、リポジトリ固有のシナリオで 46% の強化により、より高い精度、精度、再現率、および F1 スコアを達成していることを示しています。
OPENIA は、インプロセス信号の可能性を解き放つことで、LLM 支援コード生成におけるよりプロアクティブかつ効率的な品質保証メカニズムへの道を開きます。

要約(オリジナル)

Ensuring the correctness of code generated by Large Language Models (LLMs) presents a significant challenge in AI-driven software development. Existing approaches predominantly rely on black-box (closed-box) approaches that evaluate correctness post-generation, failing to utilize the rich insights embedded in the LLMs’ internal states during code generation. In this paper, we introduce OPENIA, a novel white-box (open-box) framework that leverages these internal representations to assess the correctness of LLM-generated code. OPENIA systematically analyzes the intermediate states of representative open-source LLMs specialized for code, including DeepSeek-Coder, CodeLlama, and MagicCoder, across diverse code generation benchmarks. Our empirical analysis reveals that these internal representations encode latent information, which strongly correlates with the correctness of the generated code. Building on these insights, OPENIA uses a white-box/open-box approach to make informed predictions about code correctness, offering significant advantages in adaptability and robustness over traditional classification-based methods and zero-shot approaches. Experimental results demonstrate that OPENIA consistently outperforms baseline models, achieving higher accuracy, precision, recall, and F1-Scores with up to a 2X improvement in standalone code generation and a 46% enhancement in repository-specific scenarios. By unlocking the potential of in-process signals, OPENIA paves the way for more proactive and efficient quality assurance mechanisms in LLM-assisted code generation.

arxiv情報

著者	Tuan-Dung Bui,Thanh Trong Vu,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
発行日	2025-01-22 15:04:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Correctness Assessment of Code Generated by Large Language Models Using Internal Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー