Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

要約

自然言語生成(NLG)に関する研究の大半は、サンプルとなる参照文献が限定された評価ベンチマークに依存しており、その結果、人間の判断との相関性が低くなる可能性がある。その根本的な理由は、1つの意味的な意味が実際には異なる形で表現される可能性があり、単一または少数の参照による評価は、モデルの仮説の質を正確に反映しない可能性があるからである。この問題に対処するため、本論文ではDiv-Refと名付けたシンプルで効果的な手法を提示し、参考文献の数を充実させることで既存の評価ベンチマークを強化する。大規模言語モデル（LLM）を活用し、1つの参照文の表現を複数の高品質なものに多様化することで、参照文の意味空間を可能な限りカバーする。我々は包括的な実験を行い、参照文の表現を多様化することで、自動評価と人間による評価の相関を大幅に高めることができることを実証的に示す。この考え方は、最近のLLMに基づく評価と互換性があり、複数の参照文を取り入れることで同様に利点を得ることができる。我々は、将来世代のベンチマークが、たとえLLMによって生成されたものであっても、より多くの参照を含むことを強く推奨する。研究を容易にするため、すべてのコードとデータを https://github.com/RUCAIBox/Div-Ref で公開する。

要約(オリジナル)

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model’s hypotheses. To address this issue, this paper presents a simple and effective method, named Div-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to diversify the expression of a single reference into multiple high-quality ones to cover the semantic space of the reference sentence as much as possible. We conduct comprehensive experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation. This idea is compatible with recent LLM-based evaluation which can similarly derive advantages from incorporating multiple references. We strongly encourage future generation benchmarks to include more references, even if they are generated by LLMs, which is once for all. We release all the code and data at https://github.com/RUCAIBox/Div-Ref to facilitate research.

arxiv情報

著者	Tianyi Tang,Hongyuan Lu,Yuchen Eleanor Jiang,Haoyang Huang,Dongdong Zhang,Wayne Xin Zhao,Tom Kocmi,Furu Wei
発行日	2024-04-03 15:52:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー