NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

要約

現実世界の複雑なタスクを解決するために大規模言語モデル (LLM) を使用して構築された自律エージェントが復活したことで、LLM のツールまたは関数呼び出しの基本的な能力にますます注目が集まっています。
これらのエージェントの中核となる LLM は、外部ツール、API、カスタム関数を使用して計画、実行、応答する必要があります。
ツール呼び出しに関する研究は勢いを増していますが、タスクの複雑さを表す評価ベンチマークやデータセットは遅れています。
この研究では、既存のベンチマークと評価を拡張することを目的として、そのような複雑さの 1 つであるネストされたシーケンスに焦点を当てます。
具体的には、入れ子になった API 呼び出しのシーケンス、つまり 1 つの API 呼び出しの出力が入力として後続の呼び出しに渡されるシーケンスで LLM を評価するベンチマークである NESTFUL を紹介します。
NESTFUL には、すべての関数呼び出しが実行可能な 1800 以上のネストされたシーケンスが含まれています。
複数のモデルと設定に関する実験結果では、データセット上で最もパフォーマンスの高いモデルの完全なシーケンス一致精度は 25%、勝率は 34% であることが示されており、関数呼び出しのネストされたシーケンスの側面には大きな改善の余地が必要です。
これらの結果の分析は、進捗状況を追跡するためのベンチマークに加えて、コミュニティに将来の研究の方向性を提供する可能性があります。
https://github.com/IBM/NESTFUL で、Apache 2.0 ライセンスに基づいて NESTFUL データセットをリリースしました。

要約(オリジナル)

The resurgence of autonomous agents built using large language models (LLMs) to solve complex real-world tasks has brought increased focus on LLMs’ fundamental ability of tool or function calling. At the core of these agents, an LLM must plan, execute, and respond using external tools, APIs, and custom functions. Research on tool calling has gathered momentum, but evaluation benchmarks and datasets representing the complexity of the tasks have lagged behind. In this work, we focus on one such complexity, nested sequencing, with the goal of extending existing benchmarks and evaluation. Specifically, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL contains 1800+ nested sequences where all the function calls are executable. Experimental results on multiple models and settings show that the best-performing model on the dataset has a full sequence match accuracy of 25% and win-rate of 34% necessitating a large scope for improvement in the nested sequencing aspect of function calling. Our analysis of these results provides possible future research directions for the community, in addition to a benchmark to track progress. We have released the NESTFUL dataset under the Apache 2.0 license at https://github.com/IBM/NESTFUL.

arxiv情報

著者	Kinjal Basu,Ibrahim Abdelaziz,Kiran Kate,Mayank Agarwal,Maxwell Crouse,Yara Rizk,Kelsey Bradford,Asim Munawar,Sadhana Kumaravel,Saurabh Goyal,Xin Wang,Luis A. Lastras,Pavan Kapanipathi
発行日	2025-01-23 18:44:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー