Large Language Models are Diverse Role-Players for Summarization Evaluation

要約

テキストの要約には、多くのシナリオで幅広い用途があります。
生成されたテキストの品質の評価は複雑な問題です。
言語評価の大きな課題は、既存の指標と人間による評価との間に明らかな相違があることです。
例えば、ドキュメントの要約の品質は、ヒューマン・アノテーターによって、文法や意味の正しさなどの客観的な側面と、包括性、簡潔さ、面白さなどの主観的な側面の両方から測定できます。
BLUE/ROUGE のような自動評価方法のほとんどは、上記の寸法をうまく捉えることができない場合があります。
本稿では、LLM に基づく新しい評価フレームワークを提案します。これは、生成テキストと参照テキストを客観的側面と主観的側面の両方から比較することにより、包括的な評価フレームワークを提供します。
まず、ロールプレイヤーのプロンプトメカニズムに基づいて、生成されたテキストの客観的および主観的な次元をモデル化することを提案します。
さらに、入力コンテキストに基づいて動的なロールプレイヤープロファイルを生成できる、コンテキストベースのプロンプトメカニズムを導入します。
最後に、複数の評価結果を評価結果に統合するためのバッチプロンプトに基づくマルチロールプレーヤープロンプトテクノロジを設計します。
要約のための 2 つの実際のデータセットに関する実験結果は、私たちのモデルが非常に競争力があり、人間のアノテーターと非常に高い一貫性を持っていることを示しています。

要約(オリジナル)

Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. For example, the quality of a document summary can be measured by human annotators from both objective aspects, such as grammatical and semantic correctness, as well as subjective dimensions, such as comprehensiveness, succinctness, and interestingness. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to capture the above dimensions well. In this paper, we propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects. First, we propose to model objective and subjective dimensions of generated text based on roleplayers prompting mechanism. Furthermore, we introduce a context-based prompting mechanism that is able to generate dynamic roleplayer profiles based on input context. Finally, we design a multi-roleplayer prompting technology based on batch prompting to integrate multiple evaluation results into evaluation results. Experimental results on two real datasets for summarization show that our model is highly competitive and has a very high consistency with human annotators.

arxiv情報

著者	Ning Wu,Ming Gong,Linjun Shou,Shining Liang,Daxin Jiang
発行日	2023-03-27 10:40:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Large Language Models are Diverse Role-Players for Summarization Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー