DevEval: Evaluating Code Generation in Practical Software Projects

要約

コード生成で大規模言語モデル (LLM) を評価する方法は未解決の問題です。
多くのベンチマークが提案されていますが、非現実的なプログラムの配布、不十分な依存関係、小規模プロジェクトのコンテキストなど、実際のソフトウェアプロジェクトとは一致していません。
したがって、実際のプロジェクトにおける LLM の機能はまだ不明です。
このペーパーでは、実際のプロジェクトにおける開発者の経験に合わせた、DevEval という新しいベンチマークを提案します。
DevEval は厳格なパイプラインを通じて収集されており、119 の実用的なプロジェクトから 10 のドメインをカバーする 2,690 のサンプルが含まれています。
以前のベンチマークと比較して、DevEval は、実際のプログラムの配布、十分な依存関係、十分な規模のプロジェクトコンテキストなど、複数の側面で実用的なプロジェクトに適合しています。
DevEval で 5 つの人気のある LLM (gpt-4、gpt-3.5-turbo、CodeLLaMa、StarCoder など) を評価し、コード生成における実際の能力を明らかにします。
たとえば、実験では gpt-3.5-turbo のみの最高 Pass@1 は 42 でした。
また、実際のプロジェクトにおけるコード生成の課題と将来の方向性についても説明します。
私たちは DevEval をオープンソースにし、それが実際のプロジェクトでのコード生成の開発を促進できることを期待しています。

要約(オリジナル)

How to evaluate Large Language Models (LLMs) in code generation is an open question. Many benchmarks have been proposed but are inconsistent with practical software projects, e.g., unreal program distributions, insufficient dependencies, and small-scale project contexts. Thus, the capabilities of LLMs in practical projects are still unclear. In this paper, we propose a new benchmark named DevEval, aligned with Developers’ experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess five popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo only is 42 in our experiments. We also discuss the challenges and future directions of code generation in practical projects. We open-source DevEval and hope it can facilitate the development of code generation in practical projects.

arxiv情報

著者	Jia Li,Ge Li,Yunfei Zhao,Yongmin Li,Zhi Jin,Hao Zhu,Huanyu Liu,Kaibo Liu,Lecheng Wang,Zheng Fang,Lanshen Wang,Jiazheng Ding,Xuanming Zhang,Yihong Dong,Yuqi Zhu,Bin Gu,Mengfei Yang
発行日	2024-01-26 02:36:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DevEval: Evaluating Code Generation in Practical Software Projects

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー