ToolQA: A Dataset for LLM Question Answering with External Tools

要約

大規模言語モデル (LLM) は、さまざまな NLP タスクで優れたパフォーマンスを示していますが、依然として幻覚や弱い数値推論などの課題に悩まされています。
これらの課題を克服するには、外部ツールを使用して LLM の質問応答能力を強化できます。
しかし、現在の評価方法では、LLM の内部知識を使用して回答できる質問と、ツールの使用を通じて外部情報を必要とする質問とが区別されていません。
この問題に対処するために、私たちは ToolQA と呼ばれる新しいデータセットを導入しました。これは、質問応答に外部ツールを使用する LLM の能力を忠実に評価するように設計されています。
私たちの ToolQA の開発には、データセットキュレーションのためのスケーラブルで自動化されたプロセスと、質問に答えるために外部の知識と対話するように設計された 13 の特殊なツールが含まれていました。
重要なのは、ベンチマークデータと LLM の事前トレーニングデータの間の重複を最小限に抑え、LLM のツール使用推論能力をより正確に評価できるように努めていることです。
私たちは既存のツール使用 LLM の詳細な診断を実施し、その長所、短所、および潜在的な改善点を明らかにしました。
私たちの調査結果は、LLM を評価するための新しいベンチマークを設定し、将来の進歩に向けた新しい方向性を示唆しています。
私たちのデータとコードは、GitHub 上のより広範な科学コミュニティに自由に利用できます。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs’ question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs’ internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs’ ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs’ pre-training data, enabling a more precise evaluation of LLMs’ tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

arxiv情報

著者	Yuchen Zhuang,Yue Yu,Kuan Wang,Haotian Sun,Chao Zhang
発行日	2023-06-23 05:43:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ToolQA: A Dataset for LLM Question Answering with External Tools

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー