GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

要約

ソフトウェアエンジニアリング（SE）のベンチマーク（SE）AIエージェント、特にSWEベンチは、AIエージェントのプログラミング機能の進歩を触発しています。
ただし、バージョン制御システム（VCS）操作など、重要な開発者ワークフローを見落としています。
この問題に対処するために、VCSタスクでAIエージェントのパフォーマンスを評価するための新しいベンチマークであるGitGoodBenchを提示します。
Gitgoodbenchは、許容されるオープンソースPython、Java、およびKotlinリポジトリから抽出された3つのコアGitシナリオをカバーしています。
当社のベンチマークは、包括的な評価スイート（900サンプル）、迅速なプロトタイピングバージョン（120サンプル）、トレーニングコーパス（17,469サンプル）の3つのデータセットを提供します。
カスタムツールを装備したGPT-4oを使用して、ベンチマークのプロトタイピングバージョンでベースラインパフォーマンスを確立し、全体で21.11％の解決レートを達成します。
Gitgoodbenchは、単なるプログラミングを超えた真に包括的なSEエージェントに向けて、重要な足がかりとして機能することを期待しています。

要約(オリジナル)

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

arxiv情報

著者	Tobias Lindenbauer,Egor Bogomolov,Yaroslav Zharov
発行日	2025-05-28 16:56:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー