HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks

要約

推論時間スケーリングは、OpenAI O1やDeepSeek R1などの最近のモデルの成功にとって重要でした。
ただし、推論時間スケーリングのためにモデルをトレーニングするために使用される多くの手法では、検証できる回答が必要なタスクが必要であり、そのアプリケーションが数学、コーディング、論理的推論などのドメインに制限されます。
私たちは、人間が最初の試みを行う方法からインスピレーションを得て、他の人からの詳細なフィードバックを求め、幅広いオープンエンドの取り組みに基づいてそのようなフィードバックに基づいて改善を行います。
この目的のために、HELPSTEER3データを収集して、オープンエンドの一般ドメインタスクの推論時間スケーリングを実行できる専用のフィードバックをトレーニングし、編集モデルを編集します。
セットアップでは、1つのモデルが最初の応答を生成します。これは、2番目のモデルによってフィードバックが与えられ、3番目のモデルで使用されて応答を編集します。
Arena Hardのパフォーマンスは、Chatbot Arena ELOを強く予測するベンチマークであることを示しています。初期応答ドラフトの数、効果的なフィードバック、編集された応答をスケーリングすることでブーストできることを示しています。
最適にスケーリングされると、Llama 3ファミリーの70Bモデルに基づいたセットアップは、2025年3月5日現在、92.7でArenaでSOTAパフォーマンスに到達し、90.4でOpenai O1-Preview-2024-09-12を上回り、92.3でDeepseek R1を上回ります。

要約(オリジナル)

Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect HelpSteer3 data to train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.

arxiv情報

著者	Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Daniel Egert,Ellie Evans,Hoo-Chang Shin,Felipe Soares,Yi Dong,Oleksii Kuchaiev
発行日	2025-05-30 16:42:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー