DevBench: A Comprehensive Benchmark for Software Development

要約

大規模言語モデル (LLM) の最近の進歩により、コーディング機能が大幅に強化されました。
しかし、既存のベンチマークは主に、単一ファイルのコード生成やリポジトリの問題のデバッグなど、プログラミングの単純化または個別の側面に焦点を当てており、現実のプログラミング活動によって引き起こされる課題の全範囲を測定するには至っていません。
この目的を達成するために、ソフトウェア設計、環境設定、実装、受け入れテスト、単体テストなど、ソフトウェア開発ライフサイクルのさまざまな段階にわたって LLM を評価する包括的なベンチマークである DevBench を提案します。
DevBench は、幅広いプログラミング言語とドメイン、高品質のデータ収集、各タスクの慎重に設計および検証されたメトリクスを備えています。
実証研究によると、GPT-4-Turbo を含む現在の LLM は、DevBench 内に存在する課題を解決できません。
分析の結果、モデルはリポジトリ内の複雑な構造の理解、コンパイルプロセスの管理、高度なプログラミング概念の把握に苦労していることが明らかになりました。
私たちの調査結果は、現実世界のプログラミングアプリケーションに向けた LLM の将来の開発に役立つ実用的な洞察を提供します。
私たちのベンチマークは https://github.com/open-compass/DevBench から入手できます。

要約(オリジナル)

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of programming, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. To this end, we propose DevBench, a comprehensive benchmark that evaluates LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. DevBench features a wide range of programming languages and domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Analyses reveal that models struggle with understanding the complex structures in the repository, managing the compilation process, and grasping advanced programming concepts. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications. Our benchmark is available at https://github.com/open-compass/DevBench

arxiv情報

著者	Bowen Li,Wenhan Wu,Ziwei Tang,Lin Shi,John Yang,Jinyang Li,Shunyu Yao,Chen Qian,Binyuan Hui,Qicheng Zhang,Zhiyin Yu,He Du,Ping Yang,Dahua Lin,Chao Peng,Kai Chen
発行日	2024-03-15 13:23:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DevBench: A Comprehensive Benchmark for Software Development

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー