HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

要約

LLM の進歩的な推論と問題解決能力を評価するために設計された新しいタスクである、自己呼び出しコード生成を紹介します。
このタスクでは、モデルに基本問題と、関連するより複雑な問題が提示されます。
基本的な問題を解決し、その解決策を利用してより複雑な問題に対処する必要があります。
この作品には 3 つの重要な貢献が含まれています。
まず、既存のベンチマークのより困難なバージョンを生成するための一般的なレシピを提案します。その結果、HumanEval Pro、MBPP Pro、および BigCodeBench-Lite Pro という 3 つの新しいベンチマークが作成されます。これらは、特に自己呼び出しコード生成で LLM を評価するように設計されています。
次に、ベンチマークでの 20 個を超える LLM の実験結果の分析から、2 つの重要な観察結果が得られます。(i) ほとんどの LLM は、HumanEval や MBPP などの従来のコード生成ベンチマークでは優れていますが、自己呼び出しタスクではパフォーマンスが低下します。
たとえば、o1-mini は HumanEval では 96.2% pass@1 を達成しますが、HumanEval Pro では 76.2% しか達成しません。
(ii) 自己呼び出しコード生成タスクでは、命令調整モデルは基本モデルと比較してわずかな改善しか示しません。
第三に、評価結果に存在する故障モードの種類を開示します。
これらすべての結果は、自己呼び出しコード生成タスクのさらなる進歩の必要性を強調し、LLM のコード推論機能の強化に関する将来の研究に新たな方向性を提供します。

要約(オリジナル)

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs’ code reasoning capabilities.

arxiv情報

著者	Zhaojian Yu,Yilun Zhao,Arman Cohan,Xiao-Ping Zhang
発行日	2024-12-30 18:58:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー