DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering

要約

自然言語生成 (NLG) タスクの既存の評価指標は、一般化能力と解釈可能性に関する課題に直面しています。
具体的には、パフォーマンスの高いメトリクスのほとんどは、特定の NLG タスクおよび評価ディメンションの評価データセットでトレーニングする必要があるため、タスク固有のデータセットへの過剰適合が発生する可能性があります。
さらに、既存の指標は各次元の評価スコアを提供するだけで、このスコアがどのように取得されるかを解釈するための証拠は明らかにされていません。
これらの課題に対処するために、DecompEval と呼ばれるシンプルかつ効果的なメトリクスを提案します。
このメトリクスは、NLG 評価を命令形式の質問応答タスクとして定式化し、評価データセットでのトレーニングを行わずに命令で調整された事前トレーニング済み言語モデル (PLM) を利用して、一般化能力を強化することを目的としています。
評価プロセスをより解釈しやすくするために、生成されたテキストの品質について考案した指示形式の質問を、各文の品質を測定するサブ質問に分解しました。
PLM によって生成されたサブ質問とその回答は、評価結果を取得するための証拠として再構成されます。
実験結果は、DecompEval がテキストの要約と対話生成を評価するための未訓練のメトリクスで最先端のパフォーマンスを達成し、また、強力なディメンションレベル/タスクレベルの一般化能力と解釈可能性を示すことを示しています。

要約(オリジナル)

Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.

arxiv情報

著者	Pei Ke,Fei Huang,Fei Mi,Yasheng Wang,Qun Liu,Xiaoyan Zhu,Minlie Huang
発行日	2023-07-13 16:16:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー