Improving Multimodal Learning with Multi-Loss Gradient Modulation

要約

オーディオやビデオなどの複数のモダリティから学習すると、補完的な情報を活用し、堅牢性を高め、コンテキストの理解とパフォーマンスを向上させる機会が得られます。
ただし、このようなモダリティを組み合わせると、特にモダリティのデータ構造、予測寄与、学習プロセスの複雑さが異なる場合に課題が生じます。
1 つのモダリティが学習プロセスを支配する可能性があり、他のモダリティからの情報の効果的な利用が妨げられ、モデルのパフォーマンスが最適以下になる可能性があることが観察されています。
この問題に対処するために、以前の研究の大部分は、単峰性の寄与を評価し、それらを均等化するためにトレーニングを動的に調整することを提案しています。
私たちは、マルチロス目標を導入し、バランシングプロセスをさらに改良することで以前の研究を改善し、加速と減速の両方向で各モダリティの学習ペースを動的に調整できるようにし、収束時にバランシング効果を段階的に廃止する機能を備えています。
3 つのオーディオビデオデータセットにわたって優れた結果を達成しました。CREMA-D では、ResNet バックボーンエンコーダーを備えたモデルが以前の最高を 1.9% ～ 12.4% 上回り、Conformer バックボーンモデルは、さまざまな融合手法にわたって 2.8% ～ 14.1% の範囲の改善を実現しました。
AVE では 2.7% ～ 7.7% の改善が見られ、UCF101 では最大 6.1% の改善が見られます。

要約(オリジナル)

Learning from multiple modalities, such as audio and video, offers opportunities for leveraging complementary information, enhancing robustness, and improving contextual understanding and performance. However, combining such modalities presents challenges, especially when modalities differ in data structure, predictive contribution, and the complexity of their learning processes. It has been observed that one modality can potentially dominate the learning process, hindering the effective utilization of information from other modalities and leading to sub-optimal model performance. To address this issue the vast majority of previous works suggest to assess the unimodal contributions and dynamically adjust the training to equalize them. We improve upon previous work by introducing a multi-loss objective and further refining the balancing process, allowing it to dynamically adjust the learning pace of each modality in both directions, acceleration and deceleration, with the ability to phase out balancing effects upon convergence. We achieve superior results across three audio-video datasets: on CREMA-D, models with ResNet backbone encoders surpass the previous best by 1.9% to 12.4%, and Conformer backbone models deliver improvements ranging from 2.8% to 14.1% across different fusion methods. On AVE, improvements range from 2.7% to 7.7%, while on UCF101, gains reach up to 6.1%.

arxiv情報

著者	Konstantinos Kontras,Christos Chatzichristos,Matthew Blaschko,Maarten De Vos
発行日	2024-05-13 17:01:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Multimodal Learning with Multi-Loss Gradient Modulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー