Benchmarking Large Language Models in Retrieval-Augmented Generation

要約

検索拡張生成 (RAG) は、大規模言語モデル (LLM) の幻覚を軽減するための有望なアプローチです。
ただし、既存の研究では、さまざまな大規模言語モデルに対する検索拡張生成の影響についての厳密な評価が不足しているため、さまざまな LLM に対する RAG の機能における潜在的なボトルネックを特定することが困難になっています。
この論文では、大規模な言語モデルに対する検索拡張生成の影響を系統的に調査します。
私たちは、ノイズ耐性、否定拒否、情報統合、反事実耐性など、RAG に必要な 4 つの基本的な能力において、さまざまな大規模言語モデルのパフォーマンスを分析します。
この目的を達成するために、我々は英語と中国語の両方で RAG 評価のための新しいコーパスである検索拡張生成ベンチマーク (RGB) を確立します。
RGB は、ケースを解決するために必要な前述の基本的な能力に基づいて、ベンチマーク内のインスタンスを 4 つの個別のテストベッドに分割します。
次に、RGB 上の 6 つの代表的な LLM を評価して、RAG を適用する際の現在の LLM の課題を診断します。
評価の結果、LLM はある程度のノイズ耐性を示しますが、否定的な拒否、情報の統合、および誤った情報の処理の点で依然として大幅に苦労していることが明らかになりました。
前述の評価結果は、RAG を LLM に効果的に適用するにはまだかなりの道のりがあることを示しています。

要約(オリジナル)

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

arxiv情報

著者	Jiawei Chen,Hongyu Lin,Xianpei Han,Le Sun
発行日	2023-12-20 11:54:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Large Language Models in Retrieval-Augmented Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー