Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval

要約

HumanvalやMBPPなどの大規模な言語モデル（LLMS）の既存のコード生成ベンチマークは、LLMSのエンドツーエンドパフォーマンスを研究するように設計されています。ベンチマークは、自然言語の問題説明を入力としてフィードし、特定のプログラミング言語で生成されたコードを調べます。
ただし、この方法で明らかにされた評価スコアは、LLMが問題解決機能や言語コーディング機能に苦労しているかどうかについて、コード生成のボトルネックについて少しヒントを提供します。
この質問に答えるために、擬似コード生成ベンチマークである擬似容積を構築します。
そうすることで、さまざまなプログラミング言語でのコード生成のボトルネックを分離して特定できます。
私たちの研究では、いくつかの興味深い発見が得られます。
たとえば、PythonプログラミングのLLMSのボトルネックが問題解決である一方で、言語コーディングでは比較的苦労していることを特定します。
また、我々の研究は、問題解決能力がプログラミング言語を越えて転送する可能性があることを示していますが、言語コーディングは、特に訓練されていないプログラミング言語でより多くの言語固有の努力が必要です。
最後に、既存のベンチマークの拡張を容易にするために、擬似量を構築するパイプラインをリリースします。
Pseudoevalは、https：//anonymous.4open.science/r/pseudocodeacl25-7b74で入手できます。

要約(オリジナル)

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs’ end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation — whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: https://anonymous.4open.science/r/PseudocodeACL25-7B74.

arxiv情報

著者	Jiarong Wu,Songqiang Chen,Jialun Cao,Hau Ching Lo,Shing-Chi Cheung
発行日	2025-02-26 14:08:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー