OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

要約

私たちは、「オープンワールドビデオインスタンスのセグメンテーションとキャプション」という新しいタスクを提案します。
これまでに見たことのないオブジェクトを検出、セグメント化、追跡し、豊富なキャプションで説明する必要があります。
この困難なタスクは、ビジョンモデルと言語基盤モデルを接続する「アブストラクター」を開発することで解決できます。
具体的には、オブジェクト抽象化器とオブジェクトからテキストへの抽象化器を開発することで、マルチスケールの視覚特徴抽出器と大規模言語モデル (LLM) を接続します。
プロンプトエンコーダーブロックとトランスフォーマーブロックで構成されるオブジェクトアブストラクターは、空間的に多様なオープンワールドオブジェクトクエリを導入して、ビデオ内でこれまで見たことのないオブジェクトを発見します。
クエリ間のコントラスト損失により、オブジェクトクエリの多様性がさらに促進されます。
オブジェクトからテキストへのアブストラクタは、マスクされたクロスアテンションで強化され、オブジェクトクエリと凍結 LLM の間のブリッジとして機能し、検出されたオブジェクトごとにリッチで説明的なオブジェクト中心のキャプションを生成します。
私たちの一般化されたアプローチは、オープンワールドのビデオインスタンスのセグメンテーションと高密度ビデオオブジェクトのキャプションのタスクに共同で取り組むベースラインを、これまでに見たことのないオブジェクトで 13%、オブジェクト中心のキャプションで 10% 上回りました。

要約(オリジナル)

We propose the new task ‘open-world video instance segmentation and captioning’. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing ‘abstractors’ which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.

arxiv情報

著者	Anwesa Choudhuri,Girish Chowdhary,Alexander G. Schwing
発行日	2024-12-09 18:19:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー