LongEmbed: Extending Embedding Models for Long Context Retrieval

要約

埋め込みモデルは、IR や RAG などの最新の NLP アプリケーションにおいて中心的な役割を果たします。
LLM のコンテキスト制限は 100 万トークンを超えていますが、埋め込みモデルは依然として 8,000 トークンを超えない狭いコンテキストウィンドウに制限されており、法的契約などの長い入力を必要とするアプリケーションシナリオは控えられています。
このペーパーでは、既存の埋め込みモデルのコンテキストウィンドウ拡張を検討し、追加のトレーニングを必要とせずに制限を 32k まで押し上げます。
まず、新しく構築した LongEmbed ベンチマークで、長いコンテキストを取得するための現在の埋め込みモデルのパフォーマンスを調べます。
LongEmbed は、2 つの合成タスクと、慎重に選択された 4 つの現実世界のタスクで構成され、さまざまな長さのドキュメントと分散したターゲット情報を特徴としています。
ベンチマークの結果は、これらのモデルには大きな改善の余地があることを明らかにしています。
これに基づいて、包括的な実験により、位置補間などのトレーニング不要のコンテキストウィンドウ拡張戦略により、元のコンテキストが 512 または 4k を超えるかどうかに関係なく、既存の埋め込みモデルのコンテキストウィンドウを数倍効果的に拡張できることが示されています。
さらに、絶対位置エンコーディング (APE) を採用したモデルについては、短い入力に対して元の動作を厳密に維持しながら、顕著なパフォーマンス向上を得るためにさらに微調整できる可能性を示します。
回転位置埋め込み (RoPE) を使用するモデルの場合、NTK や SelfExtend などの RoPE 固有のメソッドを採用すると大幅な機能強化が観察され、コンテキストウィンドウ拡張に関して APE よりも RoPE が優れていることがわかります。
将来の研究を促進するために、LongEmbed ベンチマークとともに E5-Base-4k および E5-RoPE-Base をリリースします。

要約(オリジナル)

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE’s superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

arxiv情報

著者	Dawei Zhu,Liang Wang,Nan Yang,Yifan Song,Wenhao Wu,Furu Wei,Sujian Li
発行日	2024-04-18 11:29:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LongEmbed: Extending Embedding Models for Long Context Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー