Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

要約

大規模言語モデル (LLM) は、標準の汎用要約ベンチマークではすでに優れたパフォーマンスを達成していますが、より複雑な要約タスク設定でのパフォーマンスはあまり研究されていません。
したがって、モデル入力がソース記事と、目的の要約特性に対する自然言語要件の両方で構成される、命令制御可能なテキスト要約に関して LLM のベンチマークを行います。
この目的を達成するために、このタスク設定用に評価専用のデータセットを厳選し、5 つの LLM ベースの要約システムで人間による評価を実施します。
次に、4 つの異なる評価プロトコルと 11 の LLM を使用して、このタスクの LLM ベースの自動評価をベンチマークし、合計 40 の評価方法になります。
私たちの調査では、LLM にとって命令制御可能なテキストの要約が依然として困難な課題であることが明らかになりました。その理由は、(1) 評価されたすべての LLM は依然として要約に事実およびその他のタイプの誤りを犯しているためです。
(2) すべての LLM ベースの評価方法は、要約候補の品質を判断する際に人間のアノテーターと強力に一致することはできません。
(3) LLM が異なると、要約の生成と評価において大きなパフォーマンスの差異が見られます。
この方向での将来の研究を促進するために、収集したベンチマークである InstruSum を一般公開します。

要約(オリジナル)

While large language models (LLMs) already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for the desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluation on 5 LLM-based summarization systems. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods in total. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) all LLM-based evaluation methods cannot achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation. We make our collected benchmark, InstruSum, publicly available to facilitate future research in this direction.

arxiv情報

著者	Yixin Liu,Alexander R. Fabbri,Jiawen Chen,Yilun Zhao,Simeng Han,Shafiq Joty,Pengfei Liu,Dragomir Radev,Chien-Sheng Wu,Arman Cohan
発行日	2023-11-15 18:25:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー