Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

要約

【タイトル】制約のないビデオに対する強力なクロスモーダル知識蒸留

【要約】

– クロスモーダル蒸留は、異なるモダリティ間で知識を転送し、ターゲット単一モダリティの表現を豊かにするために広く使用されています。
– 最近の研究では、ビジョンとサウンドの間の時間的同期がクロスモーダル蒸留における意味的な一致に高い関連性を持っているとされていますが、制約のないビデオでは、関連しないモダリティノイズや異なる意味的相関が原因で意味的一致を保証することは困難です。
– このような問題に対処するために、本研究では、クロスモーダルコンテキストによる先生モダリティの関係ないノイズを消去する「モダリティノイズフィルター（MNF）モジュール」を提案しました。
– その後、本研究では、対照的な方法で、サンプルごとの異なる意味的相関を参照して、適応的にターゲットモダリティに有用な知識を蒸留する「コントラスティブセマンティックキャリブレーション（CSC）モジュール」を設計しました。
– 広範な実験結果は、本手法が他の蒸留手法と比較して視覚的行動認識およびビデオ検索の両方のタスクで性能向上をもたらすことを示しています。また、オーディオタグ付けタスクにも拡張され、本手法の汎化性を証明しました。ソースコードは\href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}で利用可能です。

要約(オリジナル)

Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation of the target unimodal one. Recent studies highly relate the temporal synchronization between vision and sound to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we first propose a \textit{Modality Noise Filter} (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a \textit{Contrastive Semantic Calibration} (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. Extensive experiments show that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to the audio tagging task to prove the generalization of our method. The source code is available at \href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}.

arxiv情報

著者	Wenke Xia,Xingjian Li,Andong Deng,Haoyi Xiong,Dejing Dou,Di Hu
発行日	2023-04-27 04:08:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー