USB: A Unified Summarization Benchmark Across Tasks and Domains

要約

NLPコミュニティは多くの要約ベンチマークを作成してきたが、制御と信頼性に関連する多くの重要な問題を同時に解決するのに必要な豊富な注釈を提供するものはない。(i)抽出的要約、(ii)抽象的要約、(iii)トピックに基づく要約、(iv)選択された文章を1行の要約に圧縮、(v)要約文の証拠の提示、(vi)要約文の事実の正確さの予測、(vii)要約文の根拠のない範囲の特定、(viii)要約の事実誤りの修正。このベンチマークで様々な手法を比較し、複数のタスクにおいて、中程度の大きさのファインチューニングされたモデルが、一貫して、はるかに大きな数ショットのプロンプト言語モデルを凌駕することを発見した。また、事実に関連するタスクについて、学習データを作成する既存のヒューリスティックを評価し、それらの学習は、人間ラベル付けが少ない$20times$データの学習より性能が悪いことを発見した。我々の記事は$6$のドメインから集めたので、クロスドメイン分析が容易である。あるタスクでは、訓練データの量は、それが由来するドメインよりも重要であるが、他のタスクでは、たとえ限られていても、ターゲットドメインのデータで特別に訓練する方が有益である。

要約(オリジナル)

While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization; (iii) topic-based summarization; (iv) compressing selected sentences into a one-line summary; (v) surfacing evidence for a summary sentence; (vi) predicting the factual accuracy of a summary sentence; (vii) identifying unsubstantiated spans in a summary sentence; (viii) correcting factual errors in summaries. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality-related tasks, we also evaluate existing heuristics to create training data and find that training on them results in worse performance than training on $20\times$ less human-labeled data. Our articles draw from $6$ domains, facilitating cross-domain analysis. On some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial.

arxiv情報

著者	Kundan Krishna,Prakhar Gupta,Sanjana Ramprasad,Byron C. Wallace,Jeffrey P. Bigham,Zachary C. Lipton
発行日	2023-12-04 15:53:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

USB: A Unified Summarization Benchmark Across Tasks and Domains

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー