FunctionChat-Bench: Comprehensive Evaluation of Language Models’ Generative Capabilities in Korean Tool-use Dialogs

要約

この研究では、ツール使用ダイアログにおける言語モデルの生成機能を調査します。
ツール使用ダイアログでのモデルの出力を、評価の側面として機能するツール呼び出し、回答完了、スロット質問、および関連性検出の 4 つの異なるタイプに分類します。
700項目の評価項目と自動評価プログラムからなるFunctionChat-Benchを紹介します。
このベンチマークを使用して、関数呼び出しをサポートするいくつかの言語モデルを評価します。
私たちの調査結果は、言語モデルがシングルターンのツール呼び出しシナリオでは高い精度を示す可能性がある一方で、これがマルチターン環境では必ずしも優れた生成パフォーマンスにつながるわけではないことを示しています。
私たちは、関数呼び出しに必要な機能は、ツール呼び出しメッセージの生成を超えて拡張されると主張します。
また、ユーザーを惹きつける会話型メッセージを効果的に生成する必要もあります。

要約(オリジナル)

This study investigates language models’ generative capabilities in tool-use dialogs. We categorize the models’ outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

arxiv情報

著者	Shinbok Lee,Gaeun Seo,Daniel Lee,Byeongil Ko,Sunghee Jung,Myeongcheol Shin
発行日	2024-11-21 11:59:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FunctionChat-Bench: Comprehensive Evaluation of Language Models’ Generative Capabilities in Korean Tool-use Dialogs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー