LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

要約

最近の報告によると、大規模な言語モデル（LLM）は、競争力のあるプログラミングでエリート人を上回ると主張しています。
国際的なアルゴリズムコンテストのメダリストのグループからの知識を活用して、LLMが人間の専門家とどのように異なるか、そしてまだ限界が残っている場所を調べることを再検討します。
LiveCodeBench Proは、データ汚染の可能性を減らすために継続的に更新されるCodeforces、ICPC、およびIOIの問題で構成されるベンチマークです。
オリンピアードメダリストのチームは、アルゴリズムカテゴリのすべての問題に注釈を付け、モデル生成された提出に失敗したラインごとの分析を実施します。
この新しいデータとベンチマークを使用して、フロンティアモデルには依然として大きな制限があることがわかります。外部ツールがなければ、最高のモデルは、中程度の問題で53％のパス@1と、困難な問題で0％しか達成しません。
また、LLMSは実装が多い問題で成功しますが、微妙なアルゴリズムの推論と複雑な症例分析に苦労し、しばしば自信を持って誤った正当化を生成します。
高性能は、実装の精度とツールの増強によって主に駆動されるように見えますが、優れた推論ではありません。
したがって、LiveCodebench Proは、人間のグランドマスターレベルとの大きなギャップを強調し、コード中心のLLM推論の将来の改善を促進するための微調整された診断を提供します。

要約(オリジナル)

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

arxiv情報

著者	Zihan Zheng,Zerui Cheng,Zeyu Shen,Shang Zhou,Kaiyuan Liu,Hansen He,Dongruixuan Li,Stanley Wei,Hangyi Hao,Jianzhu Yao,Peiyao Sheng,Zixuan Wang,Wenhao Chai,Aleksandra Korolova,Peter Henderson,Sanjeev Arora,Pramod Viswanath,Jingbo Shang,Saining Xie
発行日	2025-06-13 16:29:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー