Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

要約

ツール学習は、多様なツールを使用して大規模な言語モデル（LLM）を増強することを目的としており、実用的なタスクを解決するためのエージェントとして機能することができます。
ツール使用LLMSのコンテキストの長さが限られているため、情報検索（IR）モデルを採用して、大きなツールセットから便利なツールを選択することは、重要な初期ステップです。
ただし、ツール検索タスクにおけるIRモデルのパフォーマンスは、目の当たり症状のままであり、不明のままです。
ほとんどのツール使用ベンチマークは、実際のシナリオからはほど遠い各タスクに関連するツールの小さなセットを手動で事前に解決することにより、このステップを簡素化します。
このホワイトペーパーでは、7.6K多様な検索タスクを含む不均一なツール検索ベンチマークと、既存のデータセットから収集された43KツールのコーパスであるToolretを提案します。
Toolretで6種類のモデルをベンチマークします。
驚くべきことに、従来のIRベンチマークで強力なパフォーマンスを持つモデルでさえ、Toolretでパフォーマンスが低下します。
この低検索品質は、ツール使用LLMのタスク合格率を低下させます。
さらにステップとして、200Kを超えるインスタンスを備えた大規模なトレーニングデータセットを提供し、IRモデルのツール検索機能を大幅に最適化します。

要約(オリジナル)

Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

arxiv情報

著者	Zhengliang Shi,Yuhan Wang,Lingyong Yan,Pengjie Ren,Shuaiqiang Wang,Dawei Yin,Zhaochun Ren
発行日	2025-05-26 15:19:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー