NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

要約

既存のコード言語モデル（コード LM）の評価ベンチマークは、LM が機能的に正しいコードを生成できるかどうかにほぼ特化している。現実のソフトウェア工学では、開発者は機能的な正しさを超えて考える。開発者は、ある機能を「どのように」実装すれば、効率性、セキュリティ、保守性といったシステム全体の設計目標を達成できるかという要求を持っている。また、LMが要求やコードのセマンティクスをしっかりと理解していれば、LMをより信頼することができる。我々は、非機能要件と、機能要件と非機能要件の両方に対する単純な分類インスタンスについて、コードLMを評価する新しいベンチマークNoFunEvalを提案する。開発者がドメイン知識をLMに伝える方法として、Coding Concepts (CoCo)というプロンプト手法を提案する。我々は、22のコードLMの広範な評価を行った。その結果、我々が開発したベンチマークでテストした場合、これらのLMは概して失敗し、学習セットアップに基本的な盲点があることがわかった。驚くべきことに、人気のあるHumanEvalベンチマークに由来する機能的正しさのインスタンスに対する分類精度さえも低く、そもそも機能的に正しいコードを生成するための理解力の深さと成功の源泉が疑問視されている。ベンチマークと評価スクリプトは、https://aka.ms/NoFunEval で公開する予定である。

要約(オリジナル)

Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on ‘how’ a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of requirements and code semantics. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of twenty-two code LMs. Our finding is that they generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We will release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.

arxiv情報

著者	Manav Singhal,Tushar Aggarwal,Abhijeet Awasthi,Nagarajan Natarajan,Aditya Kanade
発行日	2024-02-02 18:11:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー