Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

要約

我々は、人間と機械の対話における応答生成タスクのための大規模言語モデル（LLM）の限界について研究する。様々な対話タイプ(例えばOpen-Domain)に対して、いくつかの手法が文献で提案されている。しかし、これらの手法の評価は、ベースとなるLLM、対話タイプ、評価指標の点で限定的であった。本研究では、様々な対話タイプに適用される様々なLLM適応技術を広範囲に分析する。Llama-2とMistralの2つの基本LLMと、Open-Domain、Knowledge-Grounded、Task-Oriented、Question Answeringの4つの対話タイプを選択した。各ダイアログタイプで選択されたデータセットにおいて、文脈内学習と微調整技術の性能を評価する。また、RAG（Retrieval-Augmented Generation）とゴールドナレッジの両方のシナリオにおいて、生成の根拠となる外部知識を取り入れることの影響を評価する。自動評価基準と人間による評価プロトコルに一貫した評価基準と説明可能性の基準を採用する。我々の分析によれば、大規模言語モデルを適応させるための普遍的なベストテクニックは存在せず、それぞれのテクニックの有効性はベースとなるLLMと特定のタイプの対話の両方に依存する。最後に、最適な適応手法の評価には、自動メトリクスから得られる誤った期待や結果を避けるために、人間による評価を含めるべきである。

要約(オリジナル)

We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.

arxiv情報

著者	Simone Alghisi,Massimo Rizzoli,Gabriel Roccabruna,Seyed Mahed Mousavi,Giuseppe Riccardi
発行日	2024-07-05 11:47:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー