A User-Centric Benchmark for Evaluating Large Language Models

要約

大規模言語モデル (LLM) は、さまざまなタスクでユーザーと共同作業するために不可欠なツールです。
実際のシナリオでユーザーのニーズに応えるパフォーマンスを評価することが重要です。
多くのベンチマークが作成されていますが、それらは主に、特定の事前定義されたモデルの機能に焦点を当てています。
実際のユーザーによる LLM の使用目的について取り上げている人はほとんどいません。
この見落としに対処するために、データセット構築と評価設計の両方においてユーザーの観点から LLM のベンチマークを行うことを提案します。
まず、23 か国の 712 人の参加者を対象としたユーザー調査から、15 の LLM を使用した 1846 件の実世界のユースケースを収集しました。
これらの自己報告ケースは、7 つのユーザー意図の分類を持つユーザー報告シナリオ (URS) データセットを形成します。
次に、この本物の多文化データセットを使用して、ユーザーのニーズを満たす有効性について 10 の LLM サービスをベンチマークします。
第三に、ベンチマークスコアが、さまざまな意図にわたる LLM インタラクションにおけるユーザー報告のエクスペリエンスとよく一致していることを示します。どちらも、主観的なシナリオの見落としを強調しています。
結論として、私たちの研究は、実際のユーザーのニーズをよりよく反映した評価を促進することを目的として、ユーザー中心の観点から LLM をベンチマークすることを提案しています。
ベンチマークデータセットとコードは https://github.com/Alice1998/URS で入手できます。

要約(オリジナル)

Large Language Models (LLMs) are essential tools to collaborate with users on different tasks. Evaluating their performance to serve users’ needs in real-world scenarios is important. While many benchmarks have been created, they mainly focus on specific predefined model abilities. Few have covered the intended utilization of LLMs by real users. To address this oversight, we propose benchmarking LLMs from a user perspective in both dataset construction and evaluation designs. We first collect 1846 real-world use cases with 15 LLMs from a user study with 712 participants from 23 countries. These self-reported cases form the User Reported Scenarios(URS) dataset with a categorization of 7 user intents. Secondly, on this authentic multi-cultural dataset, we benchmark 10 LLM services on their efficacy in satisfying user needs. Thirdly, we show that our benchmark scores align well with user-reported experience in LLM interactions across diverse intents, both of which emphasize the overlook of subjective scenarios. In conclusion, our study proposes to benchmark LLMs from a user-centric perspective, aiming to facilitate evaluations that better reflect real user needs. The benchmark dataset and code are available at https://github.com/Alice1998/URS.

arxiv情報

著者	Jiayin Wang,Fengran Mo,Weizhi Ma,Peijie Sun,Min Zhang,Jian-Yun Nie
発行日	2024-04-23 01:58:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A User-Centric Benchmark for Evaluating Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー