Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments

要約

実世界のアプリケーションや環境では、異なるモダリティ（例えば、ビデオ、スピーチ、テキスト）間の相互作用が存在する。マルチモーダルな情報を自動的に処理し、最終的なアプリケーションに利用するために、マルチモーダル表現学習（MRL）が最近の活発な研究分野として浮上してきた。MRLでは、異種ソースから信頼性が高く頑健な情報表現を学習し、それらを融合させる。しかし、実際には、異なるソースから取得されるデータは一般的にノイズが多い。極端な例では、大きなノイズがデータのセマンティクスを完全に変えてしまい、並列されたマルチモーダルデータに矛盾が生じることがある。本論文では、一般化エキスパート積法により、ノイズ環境下でのマルチモーダル表現学習のための新しい手法を提案する。提案手法では、モダリティごとに独立したネットワークを学習し、そのモダリティから来る情報の信頼性を評価し、その後、各モダリティからの寄与を動的に変化させながら結合分布の推定を行う。我々は、マルチモーダル3D手指姿勢推定とマルチモーダル手術映像セグメンテーションという2つの領域からなる難しいベンチマークで本手法を評価した。その結果、両ベンチマークにおいて最先端の性能を達成することができた。また，定量的・定性的な評価により，本手法が従来の手法と比較して優れていることを示す．

要約(オリジナル)

A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, Multimodal Representation Learning (MRL) has emerged as an active area of research in recent times. MRL involves learning reliable and robust representations of information from heterogeneous sources and fusing them. However, in practice, the data acquired from different sources are typically noisy. In some extreme cases, a noise of large magnitude can completely alter the semantics of the data leading to inconsistencies in the parallel multimodal data. In this paper, we propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality, and subsequently, the contribution from each modality is dynamically varied while estimating the joint distribution. We evaluate our method on two challenging benchmarks from two diverse domains: multimodal 3D hand-pose estimation and multimodal surgical video segmentation. We attain state-of-the-art performance on both benchmarks. Our extensive quantitative and qualitative evaluations show the advantages of our method compared to previous approaches.

arxiv情報

著者	Abhinav Joshi,Naman Gupta,Jinang Shah,Binod Bhattarai,Ashutosh Modi,Danail Stoyanov
発行日	2022-11-07 14:27:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー