NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models

要約

大規模言語モデル (LLM) とツール学習を組み合わせることで、現実世界のアプリケーションで目覚ましい結果が得られています。
ツールの学習中、LLM は複数のツールをネストされた順序で呼び出すことがあり、後者のツール呼び出しは前の応答を入力パラメーターとして受け取ることがあります。
ただし、既存のベンチマークには関連するデータインスタンスが不足しているため、ネストされたツールの学習機能に関する現在の研究はまだ十分に調査されていません。
この問題に対処するために、包括的なネストされたツールの学習評価における現在のギャップを埋めるために NesTools を導入します。
NesTools は、さまざまなネスト構造を持つ大規模なネストされたツール呼び出しを構築するための新しい自動データ生成メソッドで構成されています。
手動によるレビューと改良により、データセットは高品質になり、現実世界のシナリオと緊密に一致します。
したがって、NesTools は、LLM のネストされたツールの学習能力を評価するための新しいベンチマークとして機能します。
私たちは 22 個の LLM で大規模な実験を実施し、NesTools を使用して詳細な分析を提供しました。これにより、現在の LLM はまだ複雑な入れ子になったツール学習タスクに悩まされていることがわかります。

要約(オリジナル)

Large language models (LLMs) combined with tool learning have gained impressive results in real-world applications. During tool learning, LLMs may call multiple tools in nested orders, where the latter tool call may take the former response as its input parameters. However, current research on the nested tool learning capabilities is still under-explored, since the existing benchmarks lack of relevant data instances. To address this problem, we introduce NesTools to bridge the current gap in comprehensive nested tool learning evaluations. NesTools comprises a novel automatic data generation method to construct large-scale nested tool calls with different nesting structures. With manual review and refinement, the dataset is in high quality and closely aligned with real-world scenarios. Therefore, NesTools can serve as a new benchmark to evaluate the nested tool learning abilities of LLMs. We conduct extensive experiments on 22 LLMs, and provide in-depth analyses with NesTools, which shows that current LLMs still suffer from the complex nested tool learning task.

arxiv情報

著者	Han Han,Tong Zhu,Xiang Zhang,Mengsong Wu,Hao Xiong,Wenliang Chen
発行日	2024-10-15 17:33:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー