SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

要約

プロセスまたは段階的な監督は、大規模な言語モデル（LLM）の複雑なマルチステップ推論機能を進める上で重要な役割を果たしてきました。
ただし、効率的で高品質の自動化プロセス注釈は依然として重要な課題です。
これに対処するために、各ソリューションのステップを参照ソリューションで1つまたは複数のステップに整列させることにより、シングルパス、ステップごとの注釈を可能にする新しい構造化されたフレームワークである、参照ガイド付き評価（スペア）でシングルパスアノテーションを導入します。
参照ガイド付きステップレベルの評価により、数学的推論、マルチホップ構成質問応答、空間推論の3つのドメインにまたがる4つのデータセットでプロセス監督が効果的に促進されることを示します。
ベースラインと比較すると、予備が使用されると推論パフォーマンスが向上します。（1）推論時の貪欲なデコードのためのオフラインRLセットアップの微調整モデル、および（2）複数のLLM生成出力をランキング/集約するためのトレーニング報酬モデル。
さらに、Spareは、挑戦的な数学データセットで競争力のあるパフォーマンスを達成しながら、2.6倍の効率性を提供し、ツリー検索ベースの自動注釈と比較して実行時間の38％しか必要としません。
コードベースは、訓練されたスペアPRMモデルとともに、さらなる研究と再現性を促進するために公開されています。

要約(オリジナル)

Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.

arxiv情報

著者	Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych
発行日	2025-06-18 14:37:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー