Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

要約

私たちは、人間と機械の対話における応答生成のタスクに対する大規模言語モデル (LLM) の限界を研究します。
さまざまな対話タイプ (オープンドメインなど) について、文献でいくつかの手法が提案されています。
ただし、これらの手法の評価は、基本 LLM、対話タイプ、評価指標の点で制限されています。
この研究では、さまざまな対話タイプに適用されるさまざまな LLM 適応手法を広範囲に分析します。
私たちは、Llama-2 と Mistral という 2 つの基本 LLM と、オープンドメイン、知識ベース、タスク指向、および質問応答の 4 つの対話タイプを選択しました。
対話タイプごとに選択されたデータセット全体で、コンテキスト内学習と微調整技術のパフォーマンスを評価します。
検索拡張生成 (RAG) とゴールドナレッジの両方のシナリオで、生成を根付かせるために外部の知識を組み込むことの影響を評価します。
当社では、自動評価基準と人間による評価プロトコルに一貫した評価基準と説明可能性基準を採用しています。
私たちの分析によると、各手法の有効性はベースの LLM と特定の種類の対話の両方に依存するため、大規模な言語モデルを適応させるための普遍的な最適な手法は存在しません。
最後に重要なことですが、最適な適応手法の評価には、誤った期待や自動メトリクスから導き出される結果を避けるために人間による評価を含める必要があります。

要約(オリジナル)

We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.

arxiv情報

著者	Simone Alghisi,Massimo Rizzoli,Gabriel Roccabruna,Seyed Mahed Mousavi,Giuseppe Riccardi
発行日	2024-06-10 15:52:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー