Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation


最新の LLM は、ROUGE などの要約の品質を評価するための従来の自動化された指標が飽和状態になっているところまで、可読性の高い抽象的な要約を生成できるようになりました。
ただし、LLM は依然として、要約に不要なコンテンツ、つまりソースと矛盾する情報やサポートされていない情報を導入することがあります。
この作業では、自動事実性メトリクスをストレス テストします。


Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle “hallucinations” automatically has proved to be challenging. This in turn has motivated development of a variety of metrics intended to measure the factual consistency of generated summaries against their source. But are these approaches measuring what they purport to do? In this work, we stress-test automatic factuality metrics. Specifically, we investigate whether and to what degree superficial attributes of summary texts suffice to predict “factuality”, finding that a (supervised) model using only such shallow features is reasonably competitive with SOTA factuality scoring methods. We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. In contrast, some metrics are more sensitive to benign, non-factual edits. Motivated by these insights, we show that one can “game” (most) automatic factuality metrics, i.e., reliably inflate “factuality” scores by appending innocuous sentences to generated summaries.Taken together, our results raise questions about the degree to which we should rely on existing automated factuality metrics and what exactly we want “factuality metrics” to measure.


著者 Sanjana Ramprasad,Byron C. Wallace
発行日 2024-11-25 18:15:15+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.AI, cs.CL パーマリンク