TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

要約

事実の一貫性評価は、自然言語推論 (NLI) モデルを使用して行われることがよくありますが、これらのモデルは要約の評価において限定的な成功しか示しません。
以前の研究では、合成トレーニングデータを使用してそのようなモデルを改良しました。
ただし、データは通常、人間が書いた混乱した要約に基づいており、多くの場合、実際のモデルが生成した要約とは特性が異なり、起こり得る事実誤認の範囲は限られています。
あるいは、大規模言語モデル (LLM) は最近、生成タスクを直接評価する際に有望な結果を示していますが、実際に使用するには計算コストが高すぎます。
これらの制限を動機として、LLM を使用してモデルによって生成されたさまざまな要約に注釈を付けることで合成データを生成する手法である TrueTeacher を紹介します。
これまでの作品とは異なり、TrueTeacher は人間が書いた要約に依存しておらず、本質的に多言語対応です。
TRUE ベンチマークの実験では、データを使用してトレーニングされた学生モデルが、同様の能力を持つ最先端のモデルと LLM 教師の両方を大幅に上回るパフォーマンスを示しました。
体系的な研究では、TrueTeacher を既存の合成データ生成方法と比較し、その優位性とドメインシフトに対する堅牢性を実証します。
また、私たちの方法が多言語シナリオに一般化されることも示します。
最後に、TrueTeacher を使用して生成された大規模な合成データセット（140 万例）と、このデータでトレーニングされたチェックポイントをリリースします。

要約(オリジナル)

Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data.

arxiv情報

著者	Zorik Gekhman,Jonathan Herzig,Roee Aharoni,Chen Elkind,Idan Szpektor
発行日	2023-10-17 14:45:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー