BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

要約

この論文では、M3-Embedding と呼ばれる、多言語性、多機能性、および多粒度における多用途性が特徴の新しい埋め込みモデルを紹介します。
100 を超える作業言語をサポートできるため、多言語および言語をまたがる検索タスクでの新しい最先端のパフォーマンスが実現します。
エンベディングモデルの 3 つの一般的な検索機能、つまり高密度検索、マルチベクトル検索、およびスパース検索を同時に実行でき、現実世界の IR アプリケーションに統一されたモデル基盤を提供します。
短い文から最大 8192 トークンの長い文書に至るまで、さまざまな粒度の入力を処理できます。
M3-Embedding の効果的なトレーニングには、次の技術的貢献が含まれます。
我々は、さまざまな検索機能からの関連性スコアを教師信号として統合してトレーニングの質を高めることができる、新しい自己知識蒸留アプローチを提案します。
また、バッチ処理戦略を最適化し、大きなバッチサイズと高いトレーニングスループットを可能にして、埋め込みの識別性を確保します。
私たちの知る限り、M3-Embedding は、このような強力な汎用性を実現する最初の埋め込みモデルです。
モデルとコードは https://github.com/FlagOpen/FlagEmbedding で公開されます。

要約(オリジナル)

In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.

arxiv情報

著者	Jianlv Chen,Shitao Xiao,Peitian Zhang,Kun Luo,Defu Lian,Zheng Liu
発行日	2024-06-28 09:55:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー