Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

要約

既存の方法では大規模なモデルアーキテクチャが必要となり、高い計算量とリソース要件が発生するため、ビデオ分類にオーディオとビジュアルの両方のモダリティを活用することは困難な作業です。
一方、小規模なアーキテクチャでは、最適なパフォーマンスを達成するのが困難です。
この論文では、ビデオデータの複雑なオーディオとビジュアルの関係をキャプチャするために特別に設計されたコンパクトなモデルアーキテクチャを導入するオーディオビジュアル (AV) 融合アプローチである Attend-Fusion を提案します。
困難な YouTube-8M データセットでの広範な実験を通じて、Attend-Fusion がわずか 7,200 万のパラメータで 75.64\% の F1 スコアを達成することを実証しました。これは、完全接続後期 Fusion (75.96\%) などのより大きなベースラインモデルのパフォーマンスに匹敵します。
% F1 スコア、3 億 4,100 万パラメータ）。
Attend-Fusion は、モデルサイズを 80\% 近く削減しながら、より大きなベースラインモデルと同様のパフォーマンスを達成し、モデルの複雑さの点でその効率性を際立たせています。
私たちの研究は、Attend-Fusion モデルがビデオ分類のために音声情報と視覚情報を効果的に組み合わせ、大幅に削減されたモデルサイズで競争力のあるパフォーマンスを達成していることを示しています。
このアプローチにより、リソースに制約のある環境でさまざまなアプリケーションにわたる高性能ビデオ理解システムを展開するための新たな可能性が開かれます。

要約(オリジナル)

Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

arxiv情報

著者	Mahrukh Awan,Asmar Nadeem,Muhammad Junaid Awan,Armin Mustafa,Syed Sameed Husain
発行日	2024-08-26 17:33:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー