FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

要約

大規模言語モデル (LLM) と外部コーパスの関連文書を統合する検索拡張言語モデリング (RALM) は、LLM が事前トレーニングコーパスの範囲を超えた情報を生成できるようにする実証済みの方法です。
取得したコンテンツを入力の先頭に追加するだけで取得したコンテンツを利用するこれまでの作業では、キー/値 (KV) キャッシュを効率的に使用できないため、LLM の推論効率が低下するランタイムの問題が発生しました。
この論文では、LLM の知識の完全性を大きく損なうことなく、特定の微調整後に適切なパフォーマンスを維持しながら、コンテキストパターンを追加することで RALM の推論効率を向上させるように設計されたモジュール式 RALM である \textsc{FlashBack} を提案します。
\textsc{FlashBack} は、取得したドキュメントを先頭に追加するのではなく、KV キャッシュを効率的に利用するためにコンテキストの最後に追加します。
私たちの実験では、\textsc{FlashBack} の推論速度が、7B LLM (Llama 2) の先頭に追加する方法よりも最大 $4\times$ 速いことがわかりました。
不必要な再計算をバイパスすることで、大幅に高速な推論速度を達成することで進歩を示し、この効率の向上により推論コストが大幅に削減されます。
私たちのコードは公開される予定です。

要約(オリジナル)

Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work using utilizing retrieved content by simply prepending retrieved contents to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose \textsc{FlashBack}, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after specific fine-tuning without heavily destruct the knowledge integrity of the LLM. \textsc{FlashBack} appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. Our experiment shows that the inference speed of \textsc{FlashBack} is up to $4\times$ faster than the prepending method on a 7B LLM (Llama 2). Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost. Our code will be publicly available.

arxiv情報

著者	Runheng Liu,Xingchen Xiao,Heyan Huang,Zewen Chi,Zhijing Wu
発行日	2024-05-15 16:42:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー