Multimodal Understanding Through Correlation Maximization and Minimization

要約

タイトル：相関最大化と最小化を通じたマルチモーダル理解

要約：
– マルチモーダル学習は、異なるモダリティからの特徴表現を学習し、大規模なモデルを構築することで、ダウンストリームタスクでのパフォーマンスを向上させることに主眼をおいてきた。
– 本研究では、マルチモーダルデータの固有の性質について研究しており、次のような問いに答えることを目指している。
1) 一般的なマルチモーダルデータのより構造化された潜在表現を学習することはできるか？
2) 学習した潜在表現は、数学的にも視覚的にも直感的に理解できるか？
– 1)に答えるために、本研究では、汎用的で軽量なフレームワーク「MUCMM」を提案している。MUCMMは、大規模な事前学習済みネットワークに組み込むことができ、共通表現と個別表現の両方を学習することができる。共通表現は、各モダリティの共通点を捉えることができ、個別表現は、各モダリティのユニークな側面を捉えることができる。
– 2)に答えるために、本研究では、学習された共通表現と個別表現を要約する新しいスコアを提案している。また、スコア勾配を入力に関して可視化することで、異なる表現が何を表しているかを視覚的に理解することができる。さらに、線形設定で計算された勾配の数学的な直感を提供し、実験によってアプローチの効果を示している。

要約(オリジナル)

Multimodal learning has mainly focused on learning large models on, and fusing feature representations from, different modalities for better performances on downstream tasks. In this work, we take a detour from this trend and study the intrinsic nature of multimodal data by asking the following questions: 1) Can we learn more structured latent representations of general multimodal data?; and 2) can we intuitively understand, both mathematically and visually, what the latent representations capture? To answer 1), we propose a general and lightweight framework, Multimodal Understanding Through Correlation Maximization and Minimization (MUCMM), that can be incorporated into any large pre-trained network. MUCMM learns both the common and individual representations. The common representations capture what is common between the modalities; the individual representations capture the unique aspect of the modalities. To answer 2), we propose novel scores that summarize the learned common and individual structures and visualize the score gradients with respect to the input, visually discerning what the different representations capture. We further provide mathematical intuitions of the computed gradients in a linear setting, and demonstrate the effectiveness of our approach through a variety of experiments.

arxiv情報

著者	Yifeng Shi,Marc Niethammer
発行日	2023-05-04 19:53:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Multimodal Understanding Through Correlation Maximization and Minimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー