Open-Vocabulary Audio-Visual Semantic Segmentation

要約

オーディオビジュアルセマンティックセグメンテーション (AVSS) は、音響キューを使用してビデオ内の音声オブジェクトをセグメント化して分類することを目的としています。
ただし、ほとんどのアプローチは、近似セットの仮定に基づいて動作し、トレーニングデータから事前に定義されたカテゴリを識別するだけであり、実際のアプリケーションで新しいカテゴリを検出する一般化機能が欠けています。
この論文では、新しいタスクであるオープン語彙オーディオビジュアルセマンティックセグメンテーションを導入し、AVSS タスクを注釈付きラベル空間を超えたオープンワールドシナリオに拡張します。
これは、トレーニング中に見たことも聞いたこともないカテゴリーも含めて、すべてのカテゴリーを認識する必要がある、より困難なタスクです。
さらに、我々は、最初のオープンボキャブラリー AVSS フレームワークである OV-AVSS を提案します。これは、主に 2 つの部分で構成されます。1) オーディオビジュアルフュージョンを実行し、すべての潜在的なサウンドオブジェクトの位置を特定するためのユニバーサル音源位置特定モジュールと、2) オープンボキャブラリー
大規模な事前トレーニング済み視覚言語モデルからの事前知識を利用してカテゴリを予測する分類モジュール。
オープンボキャブラリー AVSS を適切に評価するために、AVSBench セマンティックベンチマーク、つまり AVSBench-OV に基づいてゼロショットトレーニングとテストのサブセットを分割しました。
広範な実験により、すべてのカテゴリにおけるモデルの強力なセグメンテーションとゼロショット汎化能力が実証されました。
AVSBench-OV データセットでは、OV-AVSS は基本カテゴリで 55.43% の mIoU、新規カテゴリで 29.14% の mIoU を達成し、最先端のゼロショット法を 41.88%/20.61% 上回り、オープンボキャブラリー法を 41.88%/20.61% 上回っています。
10.2%/11.6%。
コードは https://github.com/ruohaoguo/ovavss で入手できます。

要約(オリジナル)

Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.

arxiv情報

著者	Ruohao Guo,Liao Qu,Dantong Niu,Yanyu Qi,Wenzhen Yue,Ji Shi,Bowei Xing,Xianghua Ying
発行日	2024-07-31 16:14:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Open-Vocabulary Audio-Visual Semantic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー