TIM: A Time Interval Machine for Audio-Visual Action Recognition

要約

多様なアクションにより、長いビデオの中で豊かなオーディオビジュアル信号が生成されます。
最近の研究では、オーディオとビデオの 2 つの様式が、イベントの異なる時間的範囲と明確なラベルを示していることが示されています。
私たちは、オーディオイベントとビジュアルイベントの時間範囲を明示的にモデル化することで、長いビデオにおける 2 つのモダリティ間の相互作用に取り組みます。
我々は、モダリティ固有の時間間隔が長いビデオ入力を取り込むトランスフォーマーエンコーダーへのクエリとして提示されるタイムインターバルマシン (TIM) を提案します。
次に、エンコーダーは、進行中のアクションを認識するために、指定された間隔と両方のモダリティの周囲のコンテキストに注意を払います。
EPIC-KITCHEN、Perception Test、AVE という 3 つの長いオーディオビジュアルビデオデータセットで TIM をテストし、認識に関する最先端 (SOTA) を報告します。
EPIC-KITCHEN では、LLM と大幅に大規模な事前トレーニングを利用する以前の SOTA を、2.9% のトップ 1 アクション認識精度で上回りました。
さらに、高密度のマルチスケール間隔クエリを使用して、TIM をアクション検出に適応させることができ、ほとんどのメトリクスで EPIC-KITCHENS-100 の SOTA を上回り、知覚テストで優れたパフォーマンスを示すことを示します。
私たちのアブレーションは、このパフォーマンスを達成する上で 2 つのモダリティを統合し、それらの時間間隔をモデル化することが重要な役割を果たしていることを示しています。
コードとモデル: https://github.com/JacobChalk/TIM

要約(オリジナル)

Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM

arxiv情報

著者	Jacob Chalk,Jaesung Huh,Evangelos Kazakos,Andrew Zisserman,Dima Damen
発行日	2024-04-08 14:30:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TIM: A Time Interval Machine for Audio-Visual Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー