Revisiting Pre-training in Audio-Visual Learning

要約

事前学習は、様々なタスクでモデルの性能を向上させるために大きな成功を収めているが、いくつかのユニモダルの状況では、ゼロから学習するよりも性能が低いことが分かっている。そこで、我々は、事前学習されたモデルは、より複雑なマルチモーダルシナリオ、特に音声と視覚のような異種モダリティに対して常に有効なのだろうかと考える。具体的には、クロスモーダル初期化とマルチモーダル結合学習という2つの学習シナリオにおいて、事前学習されたモデルの効果を調べたところ、答えは「No」であった。クロスモーダル初期化では、バッチノームパラメータの異常による「デッドチャンネル」現象が、モデル容量の活用を妨げる。そこで、我々は適応的バッチノルム再初期化（ABRi）を提案し、事前学習されたモデルの能力を対象タスクに対してよりよく活用することを目指す。マルチモーダル学習において、事前に学習した単一モダルのエンコーダが、他のモダルのエンコーダに悪影響を与えることが分かっている。この問題を解決するために、我々は2段階のフュージョンチューニング戦略を導入し、事前学習された知識をより有効に活用しながら、適応的なマスキング手法により、各ユニモーダルエンコーダを協調的に動作させる。実験の結果、我々の手法は事前学習されたモデルの潜在能力をさらに引き出し、オーディオビジュアル学習におけるパフォーマンスを向上させることができることが示された。

要約(オリジナル)

Pre-training technique has gained tremendous success in enhancing model performance on various tasks, but found to perform worse than training from scratch in some uni-modal situations. This inspires us to think: are the pre-trained models always effective in the more complex multi-modal scenario, especially for the heterogeneous modalities such as audio and visual ones? We find that the answer is No. Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning. When cross-modal initialization is applied, the phenomena of ‘dead channel’ caused by abnormal Batchnorm parameters hinders the utilization of model capacity. Thus, we propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks. In multi-modal joint learning, we find a strong pre-trained uni-modal encoder would bring negative effects on the encoder of another modality. To alleviate such problem, we introduce a two-stage Fusion Tuning strategy, taking better advantage of the pre-trained knowledge while making the uni-modal encoders cooperate with an adaptive masking method. The experiment results show that our methods could further exploit pre-trained models’ potential and boost performance in audio-visual learning.

arxiv情報

著者	Ruoxuan Feng,Wenke Xia,Di Hu
発行日	2023-02-07 15:34:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Revisiting Pre-training in Audio-Visual Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー