GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

要約

高性能ソフトウェアの開発は、専門的な専門知識を必要とする複雑なタスクです。
高性能ソフトウェアの開発における言語モデルの機能を評価するためのベンチマークであるGSOを紹介します。
パフォーマンステストを生成および実行する自動化されたパイプラインを開発し、リポジトリのコミット履歴を分析して、10のコードベースにわたって102の挑戦的な最適化タスクを特定し、多様なドメインとプログラミング言語にまたがります。
エージェントには、正確な仕様としてコードベースとパフォーマンステストが提供され、エキスパート開発者の最適化に対して測定されるランタイム効率の向上を任されます。
私たちの定量的評価は、主要なSwe-Agentsが大幅に苦労し、5％未満の成功率を達成し、推論時間スケーリングでも改善が限られていることを明らかにしています。
当社の定性分析では、低レベルの言語の難しさ、怠zyな最適化戦略の実践、正確にローカライズするボトルネックの課題など、重要な障害モードを特定します。
ベンチマークのコードとアーティファクトをエージェントの軌跡とともにリリースして、将来の研究を可能にします。

要約(オリジナル)

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models’ capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

arxiv情報

著者	Manish Shetty,Naman Jain,Jinjian Liu,Vijay Kethanaboyina,Koushik Sen,Ion Stoica
発行日	2025-05-29 17:14:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー