PaperBench: Evaluating AI’s Ability to Replicate AI Research

要約

最先端のAI研究を再現するAIエージェントの能力を評価するベンチマーク、PaperBenchを紹介する。エージェントは、ICML2024のスポットライト論文とオーラル論文20本をゼロから再現する必要があり、これには論文投稿の理解、コードベースの開発、実験の成功などが含まれる。客観的な評価のために、各レプリケーションタスクを明確な採点基準を持つより小さなサブタスクに階層的に分解するルーブリックを開発しました。PaperBenchには合計8,316の個別評価可能なタスクが含まれています。ルーブリックはICMLの各論文の著者と共同開発し、正確さと現実性を追求しています。また、スケーラブルな評価を可能にするため、ルーブリックに照らして複製を自動的に採点するLLMベースのジャッジを開発し、ジャッジ用のベンチマークを別途作成することでジャッジのパフォーマンスを評価する。いくつかのフロンティアモデルをPaperBenchで評価した結果、オープンソースの足場を用いたClaude 3.5 Sonnet (New)が、平均21.0%の複製スコアを達成した。最後に、PaperBenchのサブセットに挑戦する一流のML博士を募集し、モデルはまだ人間のベースラインを上回らないことを発見した。我々は、AIエージェントのAI工学的能力を理解するための将来の研究を促進するために、我々のコードを公開する。

要約(オリジナル)

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{https://github.com/openai/preparedness}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.

arxiv情報

著者	Giulio Starace,Oliver Jaffe,Dane Sherburn,James Aung,Jun Shern Chan,Leon Maksin,Rachel Dias,Evan Mays,Benjamin Kinsella,Wyatt Thompson,Johannes Heidecke,Amelia Glaese,Tejal Patwardhan
発行日	2025-04-04 12:44:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

PaperBench: Evaluating AI’s Ability to Replicate AI Research

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー