Evaluating Large Language Models for Code Review

要約

コンテキスト：ソフトウェアの品質にはコードレビューが重要です。
最近のAI Advancesにより、大規模な言語モデル（LLM）がコードを確認および修正することができました。
現在、これらのレビューを実行するツールがあります。
ただし、それらの信頼性と精度はまだ体系的に評価されていません。
目的：この研究では、コードの正しさを検出し、改善を示唆するLLMSのパフォーマンスを比較しています。
方法：492 AI生成されたCode CodeブロックでGPT4OおよびGEMINI 2.0 Flashをテストし、Humaneval Benchmarkから164の正規コードブロックをテストしました。
コードレビュータスクを客観的にシミュレートするために、LLMがコードの正しさを評価し、必要に応じてコードを改善することを期待していました。
さまざまな構成で実験を実行し、結果について報告しました。
結果：問題の説明を使用して、GPT4OおよびGEMINI 2.0は、それぞれ68.50％と63.89％の時間の正確さを正しく分類し、492コードブロックの信頼性の492コードブロックの時間の67.83％と54.26％を修正しました。
問題のない説明がなければ、パフォーマンスは低下しました。
164の標準コードブロックの結果は異なり、パフォーマンスがコードのタイプに依存することを示唆しています。
結論：LLMコードレビューは、改善を提案し、正確性を評価するのに役立ちますが、出力が誤っているリスクがあります。
「ループLLMコードレビュー」と呼ばれる人間が関与するプロセスを提案し、出力の故障のリスクを軽減しながら知識の共有を促進します。

要約(オリジナル)

Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs’ performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time, respectively, and corrected the code 67.83% and 54.26% of the time for the 492 code blocks of varying correctness. Without problem descriptions, performance declined. The results for the 164 canonical code blocks differed, suggesting that performance depends on the type of code. Conclusion: LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs. We propose a process that involves humans, called the ‘Human in the loop LLM Code Review’ to promote knowledge sharing while mitigating the risk of faulty outputs.

arxiv情報

著者	Umut Cihan,Arda İçöz,Vahid Haratian,Eray Tüzün
発行日	2025-05-26 16:47:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Large Language Models for Code Review

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー