Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

要約

問題解決のタスクは、コードベースを修正して、与えられた問題に対処するパッチを生成することである。しかし、SWE-benchのような既存のベンチマークは、ほぼPythonのみに焦点を当てており、多様なソフトウェアエコシステム全体で大規模言語モデル（LLM）を評価するには不十分である。これを解決するために、Java、TypeScript、JavaScript、Go、Rust、C、C++をカバーするMulti-SWE-benchと呼ばれる多言語課題解決ベンチマークを紹介します。Multi-SWE-benchには合計1,632個の高品質なインスタンスが含まれており、68人の専門家アノテーターによって2,456個の候補から慎重にアノテーションされ、ベンチマークが正確で信頼できる評価を提供できることを保証している。Multi-SWE-benchに基づき、3つの代表的な手法（Agentless、SWE-agent、OpenHands）を用いて一連の最先端モデルを評価し、主要な経験的洞察を含む包括的な分析を提示する。さらに、課題解決タスクのための大規模な強化学習（RL）トレーニングデータセットを構築することを目的とした、Multi-SWE-RLのオープンソースコミュニティを立ち上げました。最初の貢献として、7つのプログラミング言語にまたがる4,723の構造化されたインスタンスセットを公開し、この領域におけるRL研究のための強固な基盤を構築します。さらに重要なこととして、データ作成パイプライン全体を、詳細なチュートリアルとともにオープンソース化し、オープンソースコミュニティに継続的な貢献とデータセットの拡張を促している。我々は、我々のMulti-SWE-benchと成長し続けるMulti-SWE-RLコミュニティが、RLをその潜在能力を最大限に引き出すための触媒となり、AGIの夜明けに一歩近づくことを想定している。

要約(オリジナル)

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

arxiv情報

著者	Daoguang Zan,Zhirong Huang,Wei Liu,Hanwu Chen,Linhao Zhang,Shulin Xin,Lu Chen,Qi Liu,Xiaojian Zhong,Aoyan Li,Siyao Liu,Yongsheng Xiao,Liangqiang Chen,Yuyu Zhang,Jing Su,Tianyu Liu,Rui Long,Kai Shen,Liang Xiang
発行日	2025-04-03 14:06:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー