Execution-Based Evaluation for Open-Domain Code Generation

要約

コーディングクエリの範囲をより現実的な設定に拡張するために、Python コード生成データセットに対する初のオープンドメイン実行ベースの自然言語 (NL) である ODEX を提案します。
ODEX には、79 の多様なライブラリにまたがる 945 の NL コードペアと、人間が作成した 1,707 の実行用テストケースがあります。
NL コードペアは、自然で実用的なコーディングクエリを促進するために、StackOverflow フォーラムから収集されています。
さらに、ODEX は、英語、スペイン語、日本語、ロシア語の 4 つの自然言語をインテントとしてサポートします。
ODEX は、トップパフォーマンスのコード言語モデル (LM) 間の興味深い動作の違いを明らかにします。
CODEX は全体的により良い結果を達成しますが、CODEGEN はスケーリングによって効果的に向上します。CODEGEN 6.1B は CODEX 12B と同等のパフォーマンスを発揮します。
どちらのモデルも、オープンドメインとクローズドドメインの間にかなりのギャップを示していますが、CODEGEN ギャップはモデルサイズとともに減少する傾向にあり、CODEX ギャップは増加します。
コード生成コミュニティ向けのオープンドメインの問題の研究を促進するために、ODEX をリリースします。

要約(オリジナル)

To extend the scope of coding queries to more realistic settings, we propose ODEX, the first Open-Domain EXecution-based natural language (NL) to Python code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. Our NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries. Moreover, ODEX supports four natural languages as intents, in English, Spanish, Japanese, and Russian. ODEX unveils intriguing behavioral differences among top-performing code language models (LM). While CODEX achieves better overall results, CODEGEN improves effectively via scaling — CODEGEN 6.1B performs comparably with CODEX 12B. Both models show substantial gaps between open and closed domains, but CODEGEN gaps tend to decrease with model size while CODEX gaps increase. We release ODEX to facilitate research into open-domain problems for the code generation community.

arxiv情報

著者	Zhiruo Wang,Shuyan Zhou,Daniel Fried,Graham Neubig
発行日	2023-05-19 14:27:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Execution-Based Evaluation for Open-Domain Code Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー