CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

要約

大規模言語モデル (LLM) の出現により、モデルのプログラミング機能が大幅に向上し、研究者の注目が高まっています。
私たちは、LLM のプログラミング理解力とコード生成能力に焦点を当てたバイリンガルベンチマークデータセットである CodeApex を提案します。
CodeApex は、概念的理解、常識的推論、およびマルチホップ推論という 3 種類の多肢選択問題で構成されており、プログラミング理解タスクで LLM を評価するように設計されています。
さらに、CodeApex は、アルゴリズムの質問と対応するテストケースを利用して、LLM によって生成されたコードの品質を評価します。
汎用モデルと特殊モデルの両方を含む 14 個の最先端の LLM を評価します。
GPT は最高のプログラミング機能を発揮し、2 つのタスクでそれぞれ約 50% と 56% の精度を達成します。
プログラミング作業にはまだ改善の余地が大きくあります。
CodeApex が LLM のコーディング能力を評価するための参考となり、LLM の開発と成長をさらに促進できることを願っています。
データセットは https://github.com/APEXLAB/CodeApex.git でリリースされます。
CodeApex 提出 Web サイトは https://apex.sjtu.edu.cn/codeapex/ です。

要約(オリジナル)

With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. We propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension and code generation abilities of LLMs. CodeApex comprises three types of multiple-choice questions: conceptual understanding, commonsense reasoning, and multi-hop reasoning, designed to evaluate LLMs on programming comprehension tasks. Additionally, CodeApex utilizes algorithmic questions and corresponding test cases to assess the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs, including both general-purpose and specialized models. GPT exhibits the best programming capabilities, achieving approximate accuracies of 50% and 56% on the two tasks, respectively. There is still significant room for improvement in programming tasks. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git. CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.

arxiv情報

著者	Lingyue Fu,Huacan Chai,Shuang Luo,Kounianhua Du,Weiming Zhang,Longteng Fan,Jiayi Lei,Renting Rui,Jianghao Lin,Yuchen Fang,Yifan Liu,Jingkuan Wang,Siyuan Qi,Kangning Zhang,Weinan Zhang,Yong Yu
発行日	2023-09-06 15:36:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー