Generating Benchmarks for Factuality Evaluation of Language Models

要約

特定のドメイン内に言語モデル (LM) を導入する前に、そのドメイン内で事実に誤りのある情報が生成される傾向を測定することが重要です。
既存の事実生成評価方法は、LM 自体からサンプリングされた事実に焦点を当てているため、評価される事実のセットを制御できず、まれでありそうもない事実が過小評価される可能性があります。
私たちは、LM の事実性を評価するためのスケーラブルなアプローチである FACTOR: Factual Assessment via Corpus TransfORmation を提案します。
FACTOR は、関心のある事実コーパスを、コーパスから真の事実を生成する LM の傾向と類似しているが不正確な記述を比較するベンチマークに自動的に変換します。
私たちはフレームワークを使用して、Wiki-FACTOR と News-FACTOR という 2 つのベンチマークを作成します。
(i) ベンチマークスコアはモデルサイズとともに増加し、LM が検索で強化されると改善します。
(ii) ベンチマークスコアは複雑さと相関しますが、モデルのランキングに関して 2 つの指標が常に一致するとは限りません。
(iii) パープレキシティとベンチマークスコアが一致しない場合、ヒューマンアノテーターによって測定されたように、後者の方がオープンエンド生成における事実をよりよく反映しています。
データとコードは https://github.com/AI21Labs/factor で公開されています。

要約(オリジナル)

Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing factual generation evaluation methods focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent rare and unlikely facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM’s propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create two benchmarks: Wiki-FACTOR and News-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score correlates with perplexity, but the two metrics do not always agree on model ranking; and (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor.

arxiv情報

著者	Dor Muhlgay,Ori Ram,Inbal Magar,Yoav Levine,Nir Ratner,Yonatan Belinkov,Omri Abend,Kevin Leyton-Brown,Amnon Shashua,Yoav Shoham
発行日	2023-07-13 17:14:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generating Benchmarks for Factuality Evaluation of Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー