Multi-lingual Evaluation of Code Generation Models

要約

評価コード生成モデルの新しいベンチマークを提示します: MBXP と Multilingual HumanEval、および MathQA-X。
これらのデータセットは 10 を超えるプログラミング言語をカバーし、元の Python データセットからターゲット言語の対応するデータにプロンプトとテストケースをトランスパイルするスケーラブルな変換フレームワークを使用して生成されます。
これらのベンチマークを使用して、多言語でのコード生成モデルのパフォーマンスを評価し、ドメイン外言語での言語モデルの一般化能力、単一言語に対する多言語モデルの利点、能力を発見しました。
モデルに新しい言語を教える数回のプロンプトと、単一言語の設定でもゼロショットの翻訳能力。
さらに、コード生成モデルを使用して大規模なブートストラップを実行し、複数の言語で合成標準ソリューションを取得します。これは、コード挿入、堅牢性、要約タスクなどの他のコード関連の評価に使用できます。
全体として、私たちのベンチマークは、言語モデルのコード生成能力をより深く理解するための重要な一歩を表しています。
コードとデータセットを https://github.com/amazon-research/mxeval で公開しています。

要約(オリジナル)

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models’ code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

arxiv情報

著者	Ben Athiwaratkun,Sanjay Krishna Gouda,Zijian Wang,Xiaopeng Li,Yuchen Tian,Ming Tan,Wasi Uddin Ahmad,Shiqi Wang,Qing Sun,Mingyue Shang,Sujan Kumar Gonugondla,Hantian Ding,Varun Kumar,Nathan Fulton,Arash Farahani,Siddhartha Jain,Robert Giaquinto,Haifeng Qian,Murali Krishna Ramanathan,Ramesh Nallapati,Baishakhi Ray,Parminder Bhatia,Sudipta Sengupta,Dan Roth,Bing Xiang
発行日	2023-03-22 18:37:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-lingual Evaluation of Code Generation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー