UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing

要約

オーディオビジュアルビデオの解析（AVVP）は、両方のユニモーダルイベント（つまり、ビデオの視覚的または音響モダリティでのみ発生するもの）とマルチモーダルイベント（つまり、両方のモダリティで同時に発生するもの）の両方で発生するものの両方をローカライズするという挑戦的なタスクを伴います。
さらに、これらすべてのイベントのクラスラベルでトレーニングデータに注釈を付けることの禁止コストは、開始時と終了時間とともに、トレーニングデータで利用可能なモダリティに存在する、ビデオレベルのラベルのみが利用できる、弱く拡張された設定でトレーニングできる限り、AVVP技術のスケーラビリティに制約を課します。
この目的のために、最近提案されたアプローチは、モデルトレーニングをよりよく導くためにセグメントレベルの擬似ラベルを生成しようとします。
ただし、これらの擬似ラベルを生成する際のセグメント間の依存関係がなく、セグメントに存在しないラベルを予測するための一般的なバイアスはパフォーマンスを制限します。
この作業は、不確実性加重された弱い監視視聴覚ビデオ解析（UWAV）と呼ばれるこれらの弱点を克服するための新しいアプローチを提案しています。
さらに、これらの推定擬似適応に関連する不確実性における当社の革新的なアプローチ要因と、改善されたトレーニングのための機能ミックスベースのトレーニングの正則化が組み込まれています。
経験的な結果は、UWAVが2つの異なるデータセットにわたって複数のメトリック上のAVVPタスクの最先端の方法を上回り、その有効性と一般化可能性を証明することを示しています。

要約(オリジナル)

Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV). Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.

arxiv情報

著者	Yung-Hsuan Lai,Janek Ebbers,Yu-Chiang Frank Wang,François Germain,Michael Jeffrey Jones,Moitreya Chatterjee
発行日	2025-05-14 17:59:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー