Exploring the Dialogue Comprehension Ability of Large Language Models

要約

LLM は対話の形式でユーザーと対話し、ユーザーの指示に従って応答を生成することがありますが、これには当然対話の理解能力が必要です。
しかし、会話理解力は一般的な言語能力であり、直接評価するのは困難です。
この研究では、対話要約タスクの助けを借りて評価を実行することを提案します。
さまざまな LLM の対話要約パフォーマンス (DIAC-Sum) を評価および分析することに加えて、生成された要約から事実に基づく質問を導き出し、それらを対話理解度のより柔軟な測定 (DIAC-FactQA) として使用します。
私たちの評価によると、LLM によって生成された概要の平均 27% に事実の不一致が含まれています。
評価された最も強力なモデルである ChatGPT でさえ、サマリーの 16% にそのようなエラーがあります。
事実に関する質問に答える場合、これはより困難ですが、評価されたすべての LLM の平均エラー率は 37.2% です。
どちらの結果も深刻な欠陥を示しています。
詳細な分析により、会話の主語/目的語の理解が依然として LLM にとって最も困難な問題であることがわかります。
さらに、LLM の対話理解能力を刺激し強化するために、自動構築されたマルチタスクデータを使用した微調整パラダイムを提案します。
実験結果は、私たちの方法が DIAC-FactQA で 10.9% のエラー率の改善を達成したことを示しています。

要約(オリジナル)

LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 37.2%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an error rate improvement of 10.9% on DIAC-FactQA.

arxiv情報

著者	Shuaijie She,Shujian Huang,Xingyun Wang,Yanke Zhou,Jiajun Chen
発行日	2023-11-16 11:56:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring the Dialogue Comprehension Ability of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー