Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

要約

タスクの出力は本質的に多様で非固有であるため、正確なコードレビューコメントを生成することは依然として大きな課題です。
プログラミングと自然言語データの両方で事前トレーニングされた大規模な言語モデルは、コード指向のタスクで適切に実行される傾向があります。
ただし、大規模な事前トレーニングは、環境への影響やプロジェクト固有の一般化可能性の問題により、常に実行可能であるとは限りません。
この作業では、まず消費者グレードのハードウェア上でオープンソースの大規模言語モデル (LLM) をパラメータ効率の高い量子化低ランク (QLoRA) 方式で微調整し、レビューコメントの生成を改善します。
最近の研究では、セマンティックメタデータ情報をプロンプトに拡張して、他のコード関連タスクのパフォーマンスを向上させる効果が実証されています。
コードレビュー活動でこれを調査するために、関数呼び出しグラフとコードサマリーを使用して入力コードパッチを強化する独自のクローズドソース LLM もプロンプトします。
どちらの戦略もレビューコメント生成のパフォーマンスを向上させ、GPT-3.5 モデル上の関数呼び出しグラフで拡張された少数ショットプロンプトが、CodeReviewer データセットの BLEU-4 スコアの約 90% で事前トレーニングされたベースラインを上回りました。
さらに、少数ショットプロンプトの Gemini-1.0 Pro、QLoRA で微調整された Code Llama および Llama 3.1 モデルは、このタスクで競争力のある結果 (25% ～ 83% のパフォーマンス向上) を達成しています。
追加の人による評価調査では、関連する定性的指標に基づいて LLM によって生成されたコードレビューコメントに対する実際の開発者の認識を反映して、実験結果がさらに検証されています。

要約(オリジナル)

Generating accurate code review comments remains a significant challenge due to the inherently diverse and non-unique nature of the task output. Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks. However, large-scale pretraining is not always feasible due to its environmental impact and project-specific generalizability issues. In this work, first we fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank (QLoRA) fashion on consumer-grade hardware to improve review comment generation. Recent studies demonstrate the efficacy of augmenting semantic metadata information into prompts to boost performance in other code-related tasks. To explore this in code review activities, we also prompt proprietary, closed-source LLMs augmenting the input code patch with function call graphs and code summaries. Both of our strategies improve the review comment generation performance, with function call graph augmented few-shot prompting on the GPT-3.5 model surpassing the pretrained baseline by around 90% BLEU-4 score on the CodeReviewer dataset. Moreover, few-shot prompted Gemini-1.0 Pro, QLoRA fine-tuned Code Llama and Llama 3.1 models achieve competitive results (ranging from 25% to 83% performance improvement) on this task. An additional human evaluation study further validates our experimental findings, reflecting real-world developers’ perceptions of LLM-generated code review comments based on relevant qualitative metrics.

arxiv情報

著者	Md. Asif Haider,Ayesha Binte Mostofa,Sk. Sabit Bin Mosaddek,Anindya Iqbal,Toufique Ahmed
発行日	2024-11-15 12:01:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー