ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

要約

Emotion Support Conversation (ESC) は、人間のストレスを軽減し、感情的なガイダンスを提供し、最終的に人間の精神的および身体的幸福を向上させることを目的とした重要なアプリケーションです。
大規模言語モデル (LLM) の進歩に伴い、多くの研究者が ESC モデルとして LLM を採用しています。
ただし、これらの LLM ベースの ESC の評価は依然として不確実です。
ロールプレイングエージェントの素晴らしい開発に触発されて、私たちは ESC 評価フレームワーク (ESC-Eval) を提案します。これはロールプレイングエージェントを使用して ESC モデルと対話し、その後対話型対話を手動で評価します。
詳細には、まず 7 つの既存のデータセットから 2,801 枚のロールプレイングカードを再編成し、ロールプレイングエージェントの役割を定義します。
次に、GPT-4 よりも混乱した人のように動作する ESC-Role と呼ばれる特定のロールプレイングモデルをトレーニングします。
第三に、ESC-Role と組織化されたロールカードを通じて、一般的な AI アシスタント LLM (ChatGPT) と ESC 指向 LLM (ExTES-Llama) を含む 14 個の LLM を ESC モデルとして使用して実験を体系的に実行します。
私たちは、さまざまな ESC モデルのインタラクティブなマルチターン対話に対して包括的な人間によるアノテーションを実行します。
結果は、ESC 指向の LLM は一般的な AI アシスタント LLM と比較して優れた ESC 能力を発揮しますが、人間のパフォーマンスにはまだ差があることを示しています。
さらに、将来の ESC モデルのスコアリングプロセスを自動化するために、アノテーション付きデータでトレーニングされた ESC-RANK を開発し、GPT-4 の 35 ポイントを超えるスコアリングパフォーマンスを達成しました。
データとコードは https://github.com/AIFlames/Esc-Eval で入手できます。

要約(オリジナル)

Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/AIFlames/Esc-Eval.

arxiv情報

著者	Haiquan Zhao,Lingyu Li,Shisong Chen,Shuqi Kong,Jiaan Wang,Kexin Huang,Tianle Gu,Yixu Wang,Wang Jian,Dandan Liang,Zhixu Li,Yan Teng,Yanghua Xiao,Yingchun Wang
発行日	2024-10-28 13:25:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー