PaperBench: Evaluating AI’s Ability to Replicate AI Research

要約

AIエージェントが最先端のAI研究を再現する能力を評価するベンチマークであるPaperBenchを紹介します。
エージェントは、紙の貢献の理解、コードベースの開発、実験の実行に成功するなど、20のICML 2024スポットライトと口頭紙をゼロから複製する必要があります。
客観的な評価のために、各複製タスクを明確なグレーディング基準でより小さなサブタスクに階層的に分解するルーブリックを開発します。
合計で、Paperbenchには8,316個の個別に段階的なタスクが含まれています。
ルーブリックは、精度とリアリズムのために、各ICMLペーパーの著者と共同開発されています。
スケーラブルな評価を有効にするために、LLMベースの裁判官も開発して、ルーブリックに対する複製の試みを自動的に評価し、裁判官のための個別のベンチマークを作成することにより、裁判官のパフォーマンスを評価します。
PaperBenchでいくつかのフロンティアモデルを評価し、オープンソースの足場を備えた最高のパフォーマンスのテストエージェントであるClaude 3.5 Sonnet（新しい）が平均21.0 \％の平均複製スコアを達成することを発見しました。
最後に、Top ML PhDSを採用してPaperbenchのサブセットを試み、モデルがまだ人間のベースラインを上回っていないことを発見しました。
we \ href {https://github.com/openai/preparedness} {オープンソース私たちのコード} AIエージェントのAIエンジニアリング能力を理解する将来の研究を促進します。

要約(オリジナル)

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{https://github.com/openai/preparedness}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.

arxiv情報

著者	Giulio Starace,Oliver Jaffe,Dane Sherburn,James Aung,Jun Shern Chan,Leon Maksin,Rachel Dias,Evan Mays,Benjamin Kinsella,Wyatt Thompson,Johannes Heidecke,Amelia Glaese,Tejal Patwardhan
発行日	2025-04-02 15:55:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PaperBench: Evaluating AI’s Ability to Replicate AI Research

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー