Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

要約

コードセキュリティとユーザビリティは、大規模な言語モデル（LLM）によって駆動されるさまざまなコーディングアシスタントアプリケーションにも不可欠です。
現在のコードセキュリティベンチマークは、コードの完了や生成などの単一の評価タスクとパラダイムのみに焦点を当てており、安全なコード生成、脆弱性の修復、差別などの次元間の包括的な評価がありません。
この論文では、LLMコードセキュリティの包括的な評価のために、コードの完了、脆弱性修復、脆弱性の検出、分類などのさまざまなタスクをカバーするマルチタスクベンチマークであるCov-Evalを最初に提案します。
その上、私たちはVC-Judgeを開発しました。これは、人間の専門家と密接に一致し、より効率的で信頼できる方法でLLM生成されたプログラムをレビューできる改善された判断モデルです。
20の独自およびオープンソースLLMの包括的な評価を実施しています。
全体として、ほとんどのLLMは脆弱なコードをよく識別しますが、彼らは依然として不安定なコードを生成し、特定の脆弱性タイプの認識と修理を実行するのに苦労する傾向があります。
広範な実験と定性的分析により、重要な課題と最適化の方向性が明らかになり、LLMコードセキュリティにおける将来の研究の洞察を提供します。

要約(オリジナル)

Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

arxiv情報

著者	Yutao Mou,Xiao Deng,Yuxiao Luo,Shikun Zhang,Wei Ye
発行日	2025-05-15 16:53:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー