ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

要約

参照ビデオオブジェクトセグメンテーション（RVOS）は、テキストの説明に基づいてビデオ全体でターゲットオブジェクトをセグメント化することを目的としています。
近年顕著な進歩にもかかわらず、現在のRVOモデルは、ビデオ言語の理解が限られているため、複雑なオブジェクトの説明を処理するのに苦労しています。
この制限に対処するために、\ textBf {Referdino}を提示します。これは、前処理された視覚的接地基礎モデルから強力な視覚言語の理解を継承し、効果的な時間的理解とオブジェクトセグメンテーション機能にさらに恵まれているエンドツーエンドのRVOモデルです。
Referdinoでは、基礎モデルをRVOに効果的に適応させるための3つの技術的革新を貢献します。1）前提条件のオブジェクトテキスト表現を活用して、時間的理解とオブジェクトの一貫性を高めるオブジェクト一貫性のある時間エンハンサー。
2）テキストと接地条件を統合して正確なオブジェクトマスクを生成する接地誘導変形マスクデコーダー。
3）パフォーマンスを損なうことなく、オブジェクトデコード効率を大幅に改善する自信を得たクエリ剪定戦略。
5つの公開RVOSベンチマークで広範な実験を実施して、提案されているReferdinoが最先端の方法を大幅に上回ることを実証しています。
プロジェクトページ：\ url {https://isee-laboratory.github.io/referdino}

要約(オリジナル)

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. Despite notable progress in recent years, current RVOS models remain struggle to handle complicated object descriptions due to their limited video-language understanding. To address this limitation, we present \textbf{ReferDINO}, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models, and is further endowed with effective temporal understanding and object segmentation capabilities. In ReferDINO, we contribute three technical innovations for effectively adapting the foundation models to RVOS: 1) an object-consistent temporal enhancer that capitalizes on the pretrained object-text representations to enhance temporal understanding and object consistency; 2) a grounding-guided deformable mask decoder that integrates text and grounding conditions to generate accurate object masks; 3) a confidence-aware query pruning strategy that significantly improves the object decoding efficiency without compromising performance. We conduct extensive experiments on five public RVOS benchmarks to demonstrate that our proposed ReferDINO outperforms state-of-the-art methods significantly. Project page: \url{https://isee-laboratory.github.io/ReferDINO}

arxiv情報

著者	Tianming Liang,Kun-Yu Lin,Chaolei Tan,Jianguo Zhang,Wei-Shi Zheng,Jian-Fang Hu
発行日	2025-01-24 16:24:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー