Large Language Models are Diverse Role-Players for Summarization Evaluation

要約

テキストの要約は、多くのシナリオで広範囲に応用できます。
生成されたテキストの品質の評価は複雑な問題です。
言語評価の大きな課題は、既存の指標と人間の評価との間に明らかな乖離があることです。
文書概要の品質は、文法や正確さなどの客観的な基準と、情報量、簡潔さ、魅力などの主観的な基準の両方で、ヒューマン・アノテーターによってさまざまな基準に基づいて評価できます。
ブルー/ルージュなどの自動評価方法のほとんどは、上記の寸法を適切に把握できない可能性があります。
本稿では、生成テキストと参考テキストを客観的・主観的両面から比較することで総合的に評価するLLMに基づく新たな評価枠組みを提案する。
まず、ロールプレイヤーのプロンプトメカニズムに基づいて、生成されたテキストの客観的および主観的な側面をモデル化することを提案します。
さらに、入力コンテキストに基づいて動的なロールプレイヤープロファイルを生成できる、コンテキストベースのプロンプトメカニズムを導入します。
最後に、バッチプロンプトに基づいてマルチロールプレイヤープロンプトテクノロジを設計し、複数の出力を最終評価結果に統合します。
要約用の 3 つの実際のデータセットでの実験結果は、私たちのモデルが非常に競争力があり、ヒューマンアノテーターとの一貫性が非常に高いことを示しています。

要約(オリジナル)

Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. A document summary’s quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. In this paper, we propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects. First, we propose to model objective and subjective dimensions of generated text based on roleplayers prompting mechanism. Furthermore, we introduce a context-based prompting mechanism that is able to generate dynamic roleplayer profiles based on input context. Finally, we design a multi-roleplayer prompting technology based on batch prompting and integrate multiple outputs into the final evaluation results. Experimental results on three real datasets for summarization show that our model is highly competitive and has a very high consistency with human annotators.

arxiv情報

著者	Ning Wu,Ming Gong,Linjun Shou,Shining Liang,Daxin Jiang
発行日	2023-09-19 10:07:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Large Language Models are Diverse Role-Players for Summarization Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー