UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

要約

大規模言語モデル (LLM) は、人間の知識との一貫性を欠いたテキストを生成する可能性があり、事実の不正確さや \textit{幻覚} を引き起こす可能性があります。
LLM の事実性を評価するための既存の研究には、LLM を使用して事実主張を抽出し、それらを事前定義された事実ソースと照合して検証することが含まれています。
ただし、これらの評価指標はタスク固有であり、拡張性がなく、さまざまなタスクにおける事実ソースの代替可能性については十分に検討されていません。
これらの課題に対処するために、人間が書いた証拠、参照文書、検索エンジンの結果、LLM 知識という 4 つの利用可能な事実ソースと、6 つの代表的なデータセットを含む 5 つのテキスト生成タスクを分類しました。
次に、プラグアンドプレイのファクトソースに対してファクトを検証するための、LLM ベースの統合された柔軟な評価フレームワークである \texttt{UFO} を提案します。
このフレームワークに基づいて5つの評価シナリオを実施します。
実験結果によると、ほとんどの QA タスクでは人間が書いた証拠と参考文書が重要であり、検索拡張 QA タスクではそれらを相互に置き換えることができます。
ニュース事実生成タスクでは、検索エンジンの結果と LLM の知識が不可欠です。
データセットとコードは \url{https://github.com/WaldenRUC/UFO} で入手できます。

要約(オリジナル)

Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.

arxiv情報

著者	Zhaoheng Huang,Zhicheng Dou,Yutao Zhu,Ji-rong Wen
発行日	2024-02-22 16:45:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー