SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

要約

現実世界のソフトウェア開発ワークフローを密接に反映する、大規模な言語モデル（LLMS）の競争力のある評価フレームワークであるSwingarenaを紹介します。
従来の静的ベンチマークとは異なり、SwingArenaは、テストケースを作成し、継続的な統合（CI）パイプラインを通じてパッチを作成し、パッチを検証するパッチを生成する提出者としてLLMをペアリングすることにより、ソフトウェア反復の共同プロセスをモデル化します。
これらのインタラクティブな評価をサポートするために、大規模なコードベースから構文的および意味的に関連するコードスニペットを提供し、複数のプログラミング言語（C ++、Python、錆、GO）をサポートすることにより、検索コード生成（RACG）モジュールを導入します。
これにより、フレームワークは、トークンの制限を尊重しながら、多様なタスクとコンテキストを拡大することができます。
2,300の問題のプールから選択された400を超える高品質の現実世界のGithub問題を使用して、GPT-4oのようなモデルが積極的なパッチ生成で優れているのに対し、CI検証の正しさを優先することを示しています。
SwingArenaは、現実的でCI駆動型ソフトウェア開発設定でLLMを評価するためのスケーラブルで拡張可能な方法論を提示します。
詳細については、プロジェクトページ（Swing-bench.github.io）をご覧ください

要約(オリジナル)

We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of software iteration by pairing LLMs as submitters, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. SwingArena presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. More details are available on our project page: swing-bench.github.io

arxiv情報

著者	Wendong Xu,Jing Xiong,Chenyang Zhao,Qiujiang Chen,Haoran Wang,Hui Shen,Zhongwei Wan,Jianbo Dai,Taiqiang Wu,He Xiao,Chaofan Tao,Z. Morley Mao,Ying Sheng,Zhijiang Guo,Hongxia Yang,Bei Yu,Lingpeng Kong,Quanquan Gu,Ngai Wong
発行日	2025-06-02 17:42:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー