LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

要約

大規模な言語モデル（LLMS）ベースの埋め込みモデルにおける最近の進歩により、特に密なベクターベースの検索において、テキスト埋め込みタスクの新しい最先端のベンチマークが確立されました。
ただし、これらのモデルは主に英語に焦点を当てており、多言語の埋め込み能力はほとんど未開拓です。
この制限に対処するために、多言語の監督を必要とせずに多言語タスクにLLMベースの埋め込みモデルを適応させる新しいゼロショットアプローチであるLusiferを提示します。
Lusifer’s Architectureは、言語統合学習者として機能する多言語エンコーダーを組み合わせており、LLMベースの埋め込みモデルが埋め込み固有のタスク用に最適化されています。
これらのコンポーネントは、コネクタとして機能する最小限のトレーニング可能なパラメーターセットを通じてシームレスに統合され、多言語エンコーダの言語理解機能を特殊な埋め込みモデルに効果的に転送します。
さらに、多言語の埋め込み性能を包括的に評価するために、5つの主要な埋め込みタスク、123の多様なデータセット、および14の言語にわたるカバレッジを含む新しいベンチマークを導入します。
広範な実験結果は、Lusiferが明示的な多言語トレーニングデータを必要とせずに、特に中程度および低リソース言語のさまざまな埋め込みタスクにわたって多言語パフォーマンスを大幅に向上させることを示しています。

要約(オリジナル)

Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER’s architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder’s language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.

arxiv情報

著者	Hieu Man,Nghia Trung Ngo,Viet Dac Lai,Ryan A. Rossi,Franck Dernoncourt,Thien Huu Nguyen
発行日	2025-05-05 05:01:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー