2D Matryoshka Training for Information Retrieval

要約

2D マトリョーシカトレーニングは、さまざまなレイヤー次元設定にわたってエンコーダーモデルを同時にトレーニングするように設計された高度な埋め込み表現トレーニングアプローチです。
この方法は、埋め込みにサブレイヤーを使用する場合、セマンティックテキスト類似性 (STS) タスクにおいて従来のトレーニングアプローチよりも高い有効性を示しています。
成功にもかかわらず、2 つの公開された実装の間には矛盾が存在し、ベースラインモデルとのさまざまな比較結果につながります。
この再現性の研究では、STS タスクで 2D マトリョーシカトレーニングの両方のバージョンを実装して評価し、分析を検索タスクに拡張します。
私たちの調査結果は、どちらのバージョンも、サブディメンションでの従来のマトリョーシカトレーニングや従来のフルサイズのモデルトレーニングアプローチよりも高い効果を達成する一方で、特定のサブレイヤーおよびサブディメンションのセットアップで個別にトレーニングされたモデルを上回るパフォーマンスを発揮しないことを示しています。
さらに、これらの結果は、教師あり (MSMARCO) 設定とゼロショット (BEIR) 設定の両方での検索タスクによく一般化されます。
さまざまな損失計算をさらに調査すると、全次元損失の組み込みや、より広範囲のターゲット次元でのトレーニングなど、検索タスクにより適した実装が明らかになります。
逆に、ドキュメントエンコーダを完全なモデル出力に固定するなど、一部の直感的なアプローチでは改善が得られません。
再現コードは https://github.com/ielab/2DMSE-Reproduction で入手できます。

要約(オリジナル)

2D Matryoshka Training is an advanced embedding representation training approach designed to train an encoder model simultaneously across various layer-dimension setups. This method has demonstrated higher effectiveness in Semantic Text Similarity (STS) tasks over traditional training approaches when using sub-layers for embeddings. Despite its success, discrepancies exist between two published implementations, leading to varied comparative results with baseline models. In this reproducibility study, we implement and evaluate both versions of 2D Matryoshka Training on STS tasks and extend our analysis to retrieval tasks. Our findings indicate that while both versions achieve higher effectiveness than traditional Matryoshka training on sub-dimensions, and traditional full-sized model training approaches, they do not outperform models trained separately on specific sub-layer and sub-dimension setups. Moreover, these results generalize well to retrieval tasks, both in supervised (MSMARCO) and zero-shot (BEIR) settings. Further explorations of different loss computations reveals more suitable implementations for retrieval tasks, such as incorporating full-dimension loss and training on a broader range of target dimensions. Conversely, some intuitive approaches, such as fixing document encoders to full model outputs, do not yield improvements. Our reproduction code is available at https://github.com/ielab/2DMSE-Reproduce.

arxiv情報

著者	Shuai Wang,Shengyao Zhuang,Bevan Koopman,Guido Zuccon
発行日	2024-11-26 10:47:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

2D Matryoshka Training for Information Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー