CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

要約

大規模な言語モデル（LLMS）は、コード生成タスクで顕著な成功を収め、コードの完了、デバッグ、プログラミング支援などのさまざまなアプリケーションを強化しています。
ただし、Humanval、MBPP、BigCodebenchなどの既存のベンチマークは、主に英語のみのプロンプトでLLMを評価し、多言語開発者がLLMと対話しながらコードミックス言語を使用する現実世界のシナリオを見落としています。
このギャップに対処するために、CodeMixbenchを紹介します。これは、コードミックスプロンプトからコード生成に対するLLMSの堅牢性を評価するために設計された新しいベンチマークです。
BigCodebenchの上に構築されたCodeMixbenchは、3つの言語ペアにわたってプロンプトの自然言語部分に制御されたコードミックス（CMD）を導入します。
1.5Bから15Bのパラメーターの範囲のオープンソースコード生成モデルの多様なセットを包括的に評価します。
私たちの結果は、コードミックスされたプロンプトは、英語のみのカウンターパートと比較して、一貫してパス@1パフォーマンスを分解し、小規模なモデルのCMDレベルが高い下でパフォーマンス低下が増加することを示しています。
CodeMixbenchは、多言語コード生成を研究するための現実的な評価フレームワークを提供し、多様な言語設定を大きく一般化する堅牢なコード生成モデルを構築するための新しい課題と方向性を強調します。

要約(オリジナル)

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to 15B parameters. Our results show that code-mixed prompts consistently degrade Pass@1 performance compared to their English-only counterparts, with performance drops increasing under higher CMD levels for smaller models. CodeMixBench provides a realistic evaluation framework for studying multilingual code generation and highlights new challenges and directions for building robust code generation models that generalize well across diverse linguistic settings.

arxiv情報

著者	Manik Sheokand,Parth Sawant
発行日	2025-05-08 08:55:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー