Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

要約

私たちは、自己中心的な行動認識のためのマルチモーダル入力とラベルなしターゲットデータを使用した新しいクロスドメイン少数ショット学習タスク (CD-FSL) に取り組みます。
この論文は、CD-FSL 設定における自己中心的な行動認識に関連する 2 つの重要な課題、すなわち (1) 自己中心的なビデオにおける極端な領域ギャップ (日常生活と産業領域など)、および (2) 現実世界のアプリケーションの計算コストに同時に取り組みます。
。
我々は、ターゲットドメインへの適応性を高め、推論コストを改善するように設計された、ドメイン適応型で計算効率の高いアプローチであるMM-CDFSLを提案します。
最初の課題に対処するために、教師モデルを使用して生徒の RGB モデルにマルチモーダル蒸留を組み込むことを提案します。
各教師モデルは、それぞれのモダリティのソースデータとターゲットデータに基づいて独立してトレーニングされます。
マルチモーダル蒸留中にラベルのないターゲットデータのみを活用することで、ターゲットドメインに対するスチューデントモデルの適応性が向上します。
さらに、マスキングを通じて入力トークンの数を減らす手法であるアンサンブルマスク推論を導入します。
このアプローチでは、アンサンブル予測によってマスキングによるパフォーマンスの低下が軽減され、2 番目の問題に効果的に対処できます。
私たちのアプローチは、複数の自己中心的なデータセットで大幅なマージンで最先端の CD-FSL アプローチを上回り、1 ショット/5 ショット設定で平均 6.12/6.10 ポイント向上し、2.2 ドルの 2.2 ドル高速な推論速度を達成しました。
。
プロジェクトページ：https://masashi-hatano.github.io/MM-CDFSL/

要約(オリジナル)

We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (e.g., daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference cost. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model’s adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving $2.2$ times faster inference speed. Project page: https://masashi-hatano.github.io/MM-CDFSL/

arxiv情報

著者	Masashi Hatano,Ryo Hachiuma,Ryo Fujii,Hideo Saito
発行日	2024-07-16 14:56:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー