ACEBench: Who Wins the Match Point in Tool Usage?

要約

大規模な言語モデル（LLM）は、特に複雑な問題を効果的に解決するためにさまざまなツールと統合された場合、意思決定と推論に大きな可能性を示しています。
ただし、LLMSのツール使用を評価するための既存のベンチマークは、いくつかの制限に直面しています。（1）限られた評価シナリオ、実際のマルチターンダイアログコンテキストの評価が不足していることがよくあります。
（2）LLMSがツールを使用する方法の詳細な評価が不十分な狭い評価の次元。
（3）評価のためのLLMSまたは実際のAPI実行への依存。これにより、重要なオーバーヘッドが導入されます。
これらの課題に対処するために、LLMSでのツールの使用を評価するための包括的なベンチマークであるAcebenchを紹介します。
Acebenchは、評価方法論に基づいて、データを3つの主要なタイプに分類します：通常、特別、およびエージェント。
「通常」は、基本的なシナリオでツールの使用を評価します。
「特別」は、曖昧または不完全な命令で状況でツールの使用を評価します。
「エージェント」は、マルチエージェントインタラクションを通じてツールの使用を評価し、実際の多ターンダイアログをシミュレートします。
Acebenchを使用して広範な実験を実施し、さまざまなLLMを詳細に分析し、さまざまなデータ型にわたるエラー原因のより詳細な調査を提供しました。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. ‘Normal’ evaluates tool usage in basic scenarios; ‘Special’ evaluates tool usage in situations with ambiguous or incomplete instructions; ‘Agent’ evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.

arxiv情報

著者	Chen Chen,Xinlong Hao,Weiwen Liu,Xu Huang,Xingshan Zeng,Shuai Yu,Dexun Li,Shuai Wang,Weinan Gan,Yuefeng Huang,Wulong Liu,Xinzhi Wang,Defu Lian,Baoqun Yin,Yasheng Wang,Wu Liu
発行日	2025-02-13 12:43:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ACEBench: Who Wins the Match Point in Tool Usage?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー