Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

要約

Jina Embeddings は、さまざまなテキスト入力を数値表現に変換することに優れた高性能の文埋め込みモデルのセットを構成し、それによってテキストの意味的本質を捉えます。
これらのモデルはテキスト生成専用に設計されているわけではありませんが、高密度検索や意味論的なテキストの類似性などのアプリケーションに優れています。
このペーパーでは、高品質のペアワイズおよびトリプレットデータセットの作成から始まる、Jina Embeddings の開発について詳しく説明します。
データセットの準備におけるデータクリーニングの重要な役割を強調し、モデルトレーニングプロセスについての深い洞察を提供し、Massive Textual Embedding Benchmark (MTEB) を使用した包括的なパフォーマンス評価で締めくくります。

要約(オリジナル)

Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating various textual inputs into numerical representations, thereby capturing the semantic essence of the text. While these models are not exclusively designed for text generation, they excel in applications such as dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of a high-quality pairwise and triplet dataset. It underlines the crucial role of data cleaning in dataset preparation, gives in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Textual Embedding Benchmark (MTEB).

arxiv情報

著者	Michael Günther,Louis Milliken,Jonathan Geuter,Georgios Mastrapas,Bo Wang,Han Xiao
発行日	2023-07-20 20:37:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー