RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

要約

検索拡張生成 (RAG) は、大規模言語モデル (LLM) と外部知識データベースの長所を統合することにより、さまざまな自然言語処理タスクで大幅な改善を示しました。
ただし、RAG では長いシーケンスの生成が導入され、計算コストとメモリコストが高くなります。
私たちは、RAG に合わせた新しいマルチレベル動的キャッシュシステムである RAGCache を提案します。
当社の分析では、現在の RAG システムをベンチマークし、パフォーマンスのボトルネック (つまり、ナレッジ注入による長いシーケンス) と最適化の機会 (つまり、ナレッジの中間状態のキャッシュ) を特定します。
これらの洞察に基づいて、取得した知識の中間状態をナレッジツリーに編成し、それらを GPU およびホストメモリ階層にキャッシュする RAGCache を設計します。
RAGCache は、LLM 推論特性と RAG 取得パターンを認識する置換ポリシーを提案します。
また、検索と推論のステップを動的にオーバーラップして、エンドツーエンドの待ち時間を最小限に抑えます。
私たちは RAGCache を実装し、最先端の LLM 推論システムである vLLM と最先端のベクトルデータベースである Faiss 上で評価します。
実験結果は、RAGCache が Faiss と統合された vLLM と比較して、最初のトークンまでの時間 (TTFT) を最大 4 倍短縮し、スループットを最大 2.1 倍向上させることを示しています。

要約(オリジナル)

Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge’s intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.

arxiv情報

著者	Chao Jin,Zili Zhang,Xuanlin Jiang,Fangyue Liu,Xin Liu,Xuanzhe Liu,Xin Jin
発行日	2024-04-25 06:47:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー