DebugBench: Evaluating Debugging Capability of Large Language Models

要約

大規模言語モデル (LLM) は、優れたコーディング能力を実証しています。
ただし、プログラミング熟練度のもう 1 つの重要な要素として、LLM のデバッグ機能は比較的未開発のままです。
LLM のデバッグ能力に関するこれまでの評価は、データ漏洩のリスク、データセットの規模、テストされたバグの種類によって大幅に制限されていました。
これらの欠点を克服するために、4,253 のインスタンスで構成される LLM デバッグベンチマークである「DebugBench」を導入します。
C++、Java、Python の 4 つの主要なバグカテゴリと 18 のマイナータイプをカバーしています。
DebugBench を構築するために、LeetCode コミュニティからコードスニペットを収集し、GPT-4 を使用してソースデータにバグを埋め込み、厳格な品質チェックを保証します。
ゼロショットシナリオで 2 つの商用モデルと 3 つのオープンソースモデルを評価します。
我々は、(1) GPT-4 のようなクローズドソースモデルは人間に比べてデバッグパフォーマンスが劣る一方、Code Llama のようなオープンソースモデルは合格率スコアを達成できないことを発見しました。
(2) デバッグの複雑さはバグのカテゴリに応じて大きく変動します。
(3) 実行時フィードバックを組み込むことは、デバッグのパフォーマンスに明らかな影響を与えますが、必ずしも役立つとは限りません。
拡張として、LLM デバッグとコード生成も比較し、クローズドソースモデルのそれらの間に強い相関関係があることを明らかにしました。
これらの発見は、デバッグにおける LLM の開発に役立ちます。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs’ debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench’, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models like GPT-4 exhibit inferior debugging performance compared to humans, open-source models such as Code Llama fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

arxiv情報

著者	Runchu Tian,Yining Ye,Yujia Qin,Xin Cong,Yankai Lin,Zhiyuan Liu,Maosong Sun
発行日	2024-01-09 15:46:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DebugBench: Evaluating Debugging Capability of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー