M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA framework

要約

現在のマルチモーダルアライメント戦略は、主に単一または統一されたモダリティエンコーダを使用し、元のトークン空間上でアライメントを最適化する。このような枠組みは実装が容易で、事前学習された知識を取り入れることができるが、情報バイアスが生じる可能性がある。このような問題に対処するために、JEPA（joint encoding predictive architecture）は、入力エンコーディングを出力潜在空間に変換する予測器を用いて、潜在空間上のアライメント損失を学習する。しかし、JEPAのマルチモーダルシナリオへの応用は限定的である。本論文では、予測器を多方向混合エキスパート（MoE）により実装した、スケーラブルなマルチモーダルアライメントフレームワークであるM3-Jepaを紹介する。このフレームワークが、異なる一方向タスクを交互に最適化することで、相互情報を最大化できることを、情報理論の導出により実証する。徹底的に設計された実験により、M3-Jepaが異なるモダリティやタスクで最先端の性能を得ることができ、未知のデータセットやドメインに汎化でき、学習と推論において計算効率が高いことを示す。我々の研究は、M3-Jepaが自己教師付き学習とオープンワールドモデリングに新しいパラダイムを提供する可能性を示している。

要約(オリジナル)

Current multimodal alignment strategies primarily use single or unified modality encoders, while optimizing the alignment on the original token space. Such a framework is easy to implement and incorporate with the pretrained knowledge, but might result in information bias. To deal with such issues, the joint encoding predictive architecture (JEPA) learns the alignment loss on the latent space, with a predictor to convert the input encoding to the output latent space. However, the application of JEPA in multimodal scenarios is limited so far. In this paper, we introduce M3-Jepa, a scalable multimodal alignment framework, with the predictor implemented by a multi-directional mixture of experts (MoE). We demonstrate the framework can maximize the mutual information with information theory derivations, by alternating the optimization between different uni-directional tasks. By thoroughly designed experiments, we show that M3-Jepa can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in training and inference. Our study indicates that M3-Jepa might provide a new paradigm to self-supervised learning and open-world modeling.

arxiv情報

著者	Hongyang Lei,Xiaolong Cheng,Dan Wang,Kun Fan,Qi Qin,Huazhen Huang,Yetao Wu,Qingqing Gu,Zhonglin Jiang,Yong Chen,Luo Ji
発行日	2025-05-05 16:48:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー