ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

要約

AIにおける最近の進歩にもかかわらず、複数のツールを含む複雑なマルチステップ推論タスクを実行できるシステムの開発は、依然として重要な課題である。現在のベンチマークは、最終的な答えだけでなく、中間ステップの正しさを検証することが、評価、開発、推論時間中の失敗の特定にとって重要である、ツール使用推論の現実世界の複雑さを捉えるには不十分である。このギャップを埋めるために、多段階ツール使用推論を評価するために設計された包括的ベンチマークであるToolCompを紹介する。ToolCompは、モデルと人間のアノテータとのコラボレーションによって開発され、人間が編集/検証したプロンプト、最終的な回答、およびプロセス監督ラベルを備えており、最終的な結果と中間推論の両方を評価することができる。6つの異なるモデルファミリーを評価することで、我々のデータセットの困難な性質が示され、50%未満の精度を達成したモデルが大半を占めた。さらに、ToolCompによって評価される複雑な道具使用推論を改善する能力を評価するために、結果監視報酬モデル（ORM）とプロセス監視報酬モデル（PRM）の性能を比較するための合成訓練データを生成する。その結果、PRMはORMよりも有意に汎化性が高く、基本モデルと微調整モデルの軌跡のランク付けにおいて、それぞれ19%と11%のランク@1精度の向上を達成した。これらの結果は、AIモデルの評価とトレーニングの両方におけるプロセス監視の重要な役割を浮き彫りにし、複雑で多段階の工具使用タスクにおいて、より頑健で有能なシステムへの道を開くものである。

要約(オリジナル)

Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome-supervised reward models (ORMs) with process-supervised reward models (PRMs) to assess their ability to improve complex tool-use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine-tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi-step tool-use tasks.

arxiv情報

著者	Vaskar Nath,Pranav Raja,Claire Yoon,Sean Hendryx
発行日	2025-01-02 15:10:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー