MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

要約

大規模言語モデル (LLM) の出現により、対話システムが大幅に強化されました。
しかし、LLM の対話能力を総合的に評価することは依然として課題です。
これまでのベンチマークは、主にシングルターンの対話に焦点を当てていたか、マルチターンの対話の粗粒で不完全な評価を提供しており、現実の対話の複雑さや細かいニュアンスを見落としていました。
この問題に対処するために、マルチターン対話における LLM のきめ細かい能力を評価するために特別に設計された MT-Bench-101 を導入します。
実際のマルチターン対話データの詳細な分析を行うことにより、13 の異なるタスクにおける 1388 のマルチターン対話にわたる 4208 ターンで構成される 3 層の階層的な能力分類法を構築します。
次に、MT-Bench-101 に基づいて 21 の人気のある LLM を評価し、能力とタスクの両方の観点から包括的な分析を実施し、さまざまなタスク内の対話ターン全体での LLM パフォーマンスの異なる傾向を観察します。
さらなる分析により、一般的な調整技術の利用もチャット固有の設計の利用も、LLM のマルチターン能力の明らかな強化につながっていないことが示されています。
広範なケーススタディは、私たちが設計したタスクが、対応するマルチターン能力を正確に評価していることを示唆しています。

要約(オリジナル)

The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities.

arxiv情報

著者	Ge Bai,Jie Liu,Xingyuan Bu,Yancheng He,Jiaheng Liu,Zhanhui Zhou,Zhuoran Lin,Wenbo Su,Tiezheng Ge,Bo Zheng,Wanli Ouyang
発行日	2024-02-22 18:21:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー