SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

要約

言語モデルは、それを効果的に評価する私たちの能力を上回っていますが、将来の開発のためには、その能力の最前線を研究することが不可欠です。
私たちは、現実世界のソフトウェアエンジニアリングが、次世代の言語モデルを評価するための、豊かで持続可能で挑戦的なテストベッドであると考えています。
そこで、実際の GitHub の問題から抽出された 2,294 ドルのソフトウェアエンジニアリング問題と、12 ドルの人気の Python リポジトリにわたる対応するプルリクエストを含む評価フレームワークである SWE ベンチを紹介します。
解決すべき問題の説明とともにコードベースが与えられると、言語モデルは問題に対処するためにコードベースを編集するというタスクを負います。
SWE ベンチの問題を解決するには、多くの場合、複数の関数、クラス、さらにはファイルにわたる変更を同時に理解して調整する必要があり、モデルが実行環境と対話し、非常に長いコンテキストを処理し、従来のコード生成をはるかに超える複雑な推論を実行する必要があります。
私たちの評価では、最先端の独自モデルと微調整されたモデル SWE-Llama の両方が最も単純な問題のみを解決できることが示されています。
Claude 2 と GPT-4 は、Oracle 取得ツールが提供されている場合でも、それぞれインスタンスのわずか $4.8$% と $1.7$% しか解決しません。
SWE ベンチの進歩は、より実用的で、インテリジェントで、自律的な LM への一歩を表しています。

要約(オリジナル)

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere $4.8$% and $1.7$% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

arxiv情報

著者	Carlos E. Jimenez,John Yang,Alexander Wettig,Shunyu Yao,Kexin Pei,Ofir Press,Karthik Narasimhan
発行日	2023-10-10 16:47:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー