NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

要約

大規模言語モデル (LLM) は、生産的な活動のためのコードを生成する強力な機能を備えています。
ただし、HumanEval、MBPP、DS-1000 などのコード合成の現在のベンチマークは、主にアルゴリズムとデータサイエンスの入門的なタスクに向けられており、現実のコーディングで一般的である難しい要件を十分に満たしていません。
このギャップを埋めるために、実際のコーディングタスクの複雑さとさまざまなシナリオを反映するように設計された、挑戦的なコードベンチマークである NaturalCodeBench (NCB) を提案します。
NCB は、オンラインコーディングサービスからの自然なユーザークエリから細心の注意を払って選択された Python と Java の 402 の高品質な問題で構成されており、6 つの異なるドメインをカバーしています。
実際のクエリのテストケースを作成するのは非常に難しいことに留意し、テストケースの構築の効率を高めるために半自動パイプラインも導入します。
手動ソリューションと比較して、4 倍以上の効率向上を実現します。
39 個の LLM に対する系統的な実験では、HumanEval スコアが近いモデル間での NCB のパフォーマンスの差が依然として大きい可能性があることが判明しました。これは、実用的なコード合成シナリオや HumanEval での過剰な最適化に焦点が当てられていないことを示しています。
一方で、最高のパフォーマンスを誇る GPT-4 でさえ、NCB ではまだ満足のいくものとは言えません。
評価ツールキットと開発セットは、https://github.com/THUDM/NaturalCodeBench で入手できます。

要約(オリジナル)

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.

arxiv情報

著者	Shudan Zhang,Hanlin Zhao,Xiao Liu,Qinkai Zheng,Zehan Qi,Xiaotao Gu,Xiaohan Zhang,Yuxiao Dong,Jie Tang
発行日	2024-05-07 17:52:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー