LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

要約

コード関連のアプリケーションに適用される大規模言語モデル (LLM) は、著名な分野として浮上しており、学界と産業界の両方から大きな関心を集めています。
ただし、新しく改良された LLM が開発されると、既存の評価ベンチマーク (HumanEval、MBPP など) ではその機能を評価するのに十分ではなくなります。
この研究では、コード用 LLM の包括的で汚染のない評価である LiveCodeBench を提案します。これは、LeetCode、AtCoder、および CodeForces の 3 つの競争プラットフォームにわたるコンテストから、時間の経過とともに継続的に新しい問題を収集します。
特に、私たちのベンチマークは、コード生成だけでなく、自己修復、コード実行、テスト出力予測など、より広範囲のコード関連機能にも焦点を当てています。
現在、LiveCodeBench は、2023 年 5 月から 2024 年 5 月までに公開された 400 件の高品質コーディング問題をホストしています。LiveCodeBench で 18 個の基本 LLM と 34 個の命令調整 LLM を評価しました。
汚染、全体的なパフォーマンスの比較、既存のベンチマークにおけるオーバーフィッティングの可能性、および個々のモデルの比較に関する経験的な結果を示します。
新しいシナリオとモデルを追加するための一般的なツールキットとともに、さらなるコミュニティ分析のためにすべてのプロンプトとモデルの完成をリリースします。

要約(オリジナル)

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

arxiv情報

著者	Naman Jain,King Han,Alex Gu,Wen-Ding Li,Fanjia Yan,Tianjun Zhang,Sida Wang,Armando Solar-Lezama,Koushik Sen,Ion Stoica
発行日	2024-06-06 17:41:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー