DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

要約

高密度の視覚的予測タスクは、事前定義されたカテゴリへの依存によって制約されており、視覚概念が無制限の現実世界のシナリオでの適用性を制限しています。
クリップのようなビジョン言語モデル（VLMS）は、オープンボキャブラリータスクで有望であることを示していますが、密な予測への直接の適用は、しばしばローカルの特徴表現の制限により最適ではないパフォーマンスにつながります。
この作業では、Clipの画像トークンが空間的または意味的に関連する領域からの情報を効果的に集約するのに苦労しているという観察結果を提示し、地域の識別性と空間的一貫性を欠く機能をもたらします。
この問題に対処するために、それぞれ自己関節モジュールを分離して「コンテンツ」と「コンテキスト」機能を取得することにより、クリップを強化する新しいフレームワークであるレフリップを提案します。
「コンテンツ」機能は、局所的な識別性を改善するための画像作物の表現と一致していますが、「コンテキスト」機能は、ディノなどのビジョンファンデーションモデルのガイダンスの下で空間相関を維持することを学びます。
広範な実験では、削減は、オブジェクトの検出やセマンティックセグメンテーションなど、複数のオープンボキャブラリー密度の高い予測タスクにわたって既存の方法を大幅に上回ることが示されています。
コードは\ textcolor {magenta} {https://github.com/xiaomoguhz/declip}で利用できます。

要約(オリジナル)

Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain “content” and “context” features respectively. The “content” features are aligned with image crop representations to improve local discriminability, while “context” features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolor{magenta}{https://github.com/xiaomoguhz/DeCLIP}.

arxiv情報

著者	Junjie Wang,Bin Chen,Yulin Li,Bin Kang,Yichi Chen,Zhuotao Tian
発行日	2025-05-07 13:46:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー