Measuring AI Ability to Complete Long Tasks

要約

AIベンチマークの急速な進歩にもかかわらず、ベンチマークパフォーマンスの実際の意味は不明のままです。
人間の能力の観点からAIシステムの機能を定量化するために、新しいメトリックを提案します：50％-Task-Completion Time Horizon。
これは、人間が通常、AIモデルが50％の成功率で完了できるタスクを完了するために通常取る時間です。
最初に、リベンチ、hcast、66の新しい短いタスクの組み合わせに関する関連するドメインの専門知識を持つ人間をタイミングしました。
これらのタスクでは、Claude 3.7 Sonnetなどの現在のフロンティアAIモデルの50％の期間は約50分です。
さらに、フロンティアAIタイムホライズンは2019年以来約7か月ごとに2倍になっていますが、この傾向は2024年に加速している可能性があります。AIモデルの時間の増加は、より大きな信頼性と間違いに適応する能力と、より良い論理的推論とツール使用能力と組み合わせて、主に間違いに適応する能力によって駆動されるようです。
外部の妥当性の程度を含む結果の限界と、危険な能力に対する自律性の増加の意味について説明します。
これらの結果が実際のソフトウェアタスクに一般化された場合、この傾向の外挿により、5年以内にAIシステムが現在1か月に人間がかかる多くのソフトウェアタスクを自動化できると予測します。

要約(オリジナル)

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models’ time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results — including their degree of external validity — and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

arxiv情報

著者	Thomas Kwa,Ben West,Joel Becker,Amy Deng,Katharyn Garcia,Max Hasin,Sami Jawhar,Megan Kinniment,Nate Rush,Sydney Von Arx,Ryan Bloom,Thomas Broadley,Haoxing Du,Brian Goodrich,Nikola Jurkovic,Luke Harold Miles,Seraphina Nix,Tao Lin,Neev Parikh,David Rein,Lucas Jun Koba Sato,Hjalmar Wijk,Daniel M. Ziegler,Elizabeth Barnes,Lawrence Chan
発行日	2025-03-18 17:59:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Measuring AI Ability to Complete Long Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー