M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

要約

現在のマルチモーダル学習戦略は、主に元のトークンスペースで最適化しています。
このようなフレームワークは、前提条件の言語モデルのバックボーンに簡単に組み込むことができますが、モダリティが崩壊する可能性があります。
このような問題を軽減するために、マルチモーダルタスクのジョイント埋め込み予測アーキテクチャ（JEPA）を活用して、予測因子によって入力埋め込みスペースに入力埋め込みスペースに変換し、潜在スペースでクロスモーダルアライメントを実施します。
この予測因子を専門家（MMOE）のマルチゲート混合物によって実装し、それに応じてフレームワークをM3-JEPAと呼びます。
ゲーティング関数は、モダリティ固有の情報と共有された情報を解き放ち、情報理論的最適性を導き出します。
フレームワークは、対照的な損失と正規化の両方の損失の両方で実装され、異なるマルチモーダルタスク間の代替勾配降下（AGD）によって解決されます。
徹底的に設計された実験により、M3-JEPAがさまざまなモダリティとタスクで最先端のパフォーマンスを取得し、目に見えないデータセットとドメインに一般化し、トレーニングと推論の両方で計算的に効率的であることを示します。
私たちの観察は、M3-JEPAがオープンな世界での自己監視学習の新しい基盤になる可能性があることを示唆しています。

要約(オリジナル)

Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.

arxiv情報

著者	Hongyang Lei,Xiaolong Cheng,Qi Qin,Dan Wang,Kun Fan,Huazhen Huang,Qingqing Gu,Yetao Wu,Zhonglin Jiang,Yong Chen,Luo Ji
発行日	2025-06-18 14:45:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー