Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

要約

言語モデル（LM）出力はますます自然になるにつれて、品質を評価することがこれまで以上に難しくなっています。
同時に、テスト時間計算のスケーリングを通じてLMSの「思考」時間を増やすことは、数学やコードなどのドメインで困難な問題を解決するための効果的な手法であることが証明されています。
これは自然な疑問を提起します：LMの評価能力をより多くのテスト時間計算を費やすことで改善することもできますか？
これに答えるために、評価者としての長い考え方の推論をネイティブに生成する推論モデルLMSの採用を調査します。
具体的には、（1）推論モデルを使用してより多くのテスト時間計算を活用し、（2）これらのモデルに応答全体（つまり、結果の評価）を評価するだけでなく、応答の各ステップを個別に評価する（つまり、プロセス評価）を評価するように促す方法を調べます。
実験では、LMベースの世代で観察される傾向と同様に、より多くの推論トークンを生成するときに、評価者のパフォーマンスが単調に改善されることがわかります。
さらに、これらのより正確な評価者を使用して、複数の世代を再確認し、評価時間により多くのコンピューティングを使用することが、LMの問題解決能力を改善するために、より多くのコンピューティングを使用するのと同じくらい効果的であることを実証します。

要約(オリジナル)

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs’ ‘thinking’ time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM’s evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator’s performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM’s problem-solving capability.

arxiv情報

著者	Seungone Kim,Ian Wu,Jinu Lee,Xiang Yue,Seongyun Lee,Mingyeong Moon,Kiril Gashteovski,Carolin Lawrence,Julia Hockenmaier,Graham Neubig,Sean Welleck
発行日	2025-03-25 17:41:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー