L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

要約

最近、大規模言語モデル (LLM)、特にコードで事前トレーニングされたモデルは、自然言語入力からプログラムを数ショット、さらにはゼロショットで生成する強力な機能を実証しています。
有望な結果にもかかわらず、これらのモデルの言語からコードへの生成機能の包括的な評価が著しく不足しています。
既存の研究は多くの場合、特定のタスク、モデルアーキテクチャ、または学習パラダイムに焦点を当てており、全体的な状況の断片的な理解につながります。
この研究では、セマンティック解析、数的推論、Python プログラミングの領域にわたる 7 つのタスクにおける LLM の言語からコードへの生成機能の体系的な評価である L2CEval を紹介し、パフォーマンスに影響を与える可能性のある要因を分析します。
モデルのサイズ、事前トレーニングデータ、命令の調整、およびさまざまなプロンプト方法。
モデルのパフォーマンスの評価に加えて、モデルの信頼性の調整を測定し、出力プログラムの人による評価を実施します。
これにより、さまざまなタスクやモデルにわたる典型的な故障モードを特定して分析できるようになります。
L2CEval は、言語からコードへの生成における LLM の機能と制限についての包括的な理解を提供します。
また、この分野における将来の研究の基礎を築くことを期待して、評価フレームワークとすべてのモデル出力も公開します。

要約(オリジナル)

Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising results, there is a notable lack of a comprehensive evaluation of these models language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs. This enables us to identify and analyze the typical failure modes across various tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We also release the evaluation framework and all model outputs, hoping to lay the groundwork for further future research in this domain.

arxiv情報

著者	Ansong Ni,Pengcheng Yin,Yilun Zhao,Martin Riddell,Troy Feng,Rui Shen,Stephen Yin,Ye Liu,Semih Yavuz,Caiming Xiong,Shafiq Joty,Yingbo Zhou,Dragomir Radev,Arman Cohan
発行日	2023-09-29 17:57:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー