Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

要約

近年、大規模言語モデル (LLM) は、以前の言語モデルで見られたものを上回る、注目に値する新しい機能により、非常に注目を集めています。
LLM の特に興味深い応用は、さまざまな生成モデルによって生成されたテキストの評価者としての LLM の役割です。
この研究では、テキスト生成モデルによって生成された要約における事実の一貫性の信頼できる評価者としての LLM の可能性を掘り下げます。
まず、LLM を使用した事実評価のための革新的なアプローチを紹介します。
これには、質問応答ベースの事実スコアリングプロセス全体に単一の LLM を使用することが必要になります。
これに続いて、直接的な事実スコアリングにおけるさまざまな LLM の有効性を検証し、従来の尺度や人間による注釈と比較してベンチマークを行います。
当初の予想に反して、我々の結果は、特に GPT-4 と PaLM-2 に関して、事実の指標と人間の評価との間に有意な相関関係が欠如していることを示しています。
注目すべき相関関係は、2 つの事実サブカテゴリにわたって GPT-3.5 でのみ観察されました。
さまざまな事実誤認カテゴリにわたるこれらの一貫した発見は、事実を正確に評価する現在の LLM の能力に根本的な限界があることを示唆しています。
このバージョンでは、原文の主要な点と所見を維持しながら、情報をより簡潔に示しています。

要約(オリジナル)

In recent years, Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities, surpassing those seen in earlier language models. A particularly intriguing application of LLMs is their role as evaluators for texts produced by various generative models. In this study, we delve into the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models. Initially, we introduce an innovative approach for factuality assessment using LLMs. This entails employing a singular LLM for the entirety of the question-answering-based factuality scoring process. Following this, we examine the efficacy of various LLMs in direct factuality scoring, benchmarking them against traditional measures and human annotations. Contrary to initial expectations, our results indicate a lack of significant correlations between factuality metrics and human evaluations, specifically for GPT-4 and PaLM-2. Notable correlations were only observed with GPT-3.5 across two factuality subcategories. These consistent findings across various factual error categories suggest a fundamental limitation in the current LLMs’ capability to accurately gauge factuality. This version presents the information more concisely while maintaining the main points and findings of the original text.

arxiv情報

著者	Xue-Yong Fu,Md Tahmid Rahman Laskar,Cheng Chen,Shashi Bhushan TN
発行日	2023-11-01 17:42:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー