A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

要約

大規模言語モデル (LLM) からのテキスト埋め込みは、情報検索、意味論的なテキストの類似性などのタスクで優れた結果を達成しました。この研究では、興味深い発見を示します。LLM ベースの埋め込みにテキストをフィードすると、取得されたテキストが
埋め込みは、入力テキスト内のキートークンと揃えることができます。
まず、8 つの LLM ベースのエンベッダーでこの現象を完全に分析し、この現象が普遍的であり、モデルアーキテクチャ、トレーニング戦略、および埋め込み方法の影響を受けないことを示します。
より詳細な分析により、これらのエンベッダーとその LLM バックボーンの間の埋め込み空間の主な変化は、第 1 主成分にあることがわかります。
第 1 主成分を調整することで、テキストの埋め込みをキートークンに揃えることができます。
最後に、この発見の広大な応用可能性を実証するために、いくつかの例を示します。 (1) 整列されたトークンに基づいたシンプルで実用的なスパース検索方法を提案します。これは、同じモデルの密な検索効果の 80% を達成しながら、削減効果を達成できます。
計算が大幅に増加します。
(2) 私たちの調査結果は、この分野における新しい技術 (例: 命令追従埋め込み) やあいまいな概念 (例: 意味的関連性と類似性) を理解するのに役立つ新しい視点を提供することを示します。

要約(オリジナル)

Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the LLM-based embedder, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight LLM-based embedders and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we find that the main change in embedding space between these embedders and their LLM backbones is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a novel perspective to help understand novel technologies (e.g., instruction-following embedding) and fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.

arxiv情報

著者	Zhijie Nie,Richong Zhang,Zhanyu Wu
発行日	2024-12-27 05:56:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー