Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

要約

CLIPのような事前学習された視覚言語モデルは、視覚とテキストの埋め込み空間がうまく整合しているため、困難なOpen-Vocabulary Segmentation (OVS)タスクに対処するためにますます使用されるようになってきている。典型的な解決策としては、トレーニング中にCLIPをフリーズさせて一方的にゼロショット能力を維持するか、CLIPのビジョンエンコーダを微調整して局所領域に対する知覚感度を達成する方法がある。しかし、ビジョンとテキストの協調最適化を組み込んだものはほとんどない。これに基づいて、我々は、入力画像と相互作用することにより、各テキスト埋め込みを適応的に強化するContent-Dependent Transferを提案し、テキスト表現を最適化するためのパラメータ効率の良い方法を提示する。さらに、CLIPのゼロショット能力を維持するための補償として、オリジナルのCLIP-V表現を見直す表現補償戦略を導入する。このようにして、CLIPの視覚とテキスト表現は協調的に最適化され、視覚とテキストの特徴空間の整合が強化される。我々の知る限り、OVS分野でビジョンとテキストを協調的に最適化するメカニズムを確立したのは我々が初めてである。広範な実験により、我々の手法が一般的なOVSベンチマークにおいて優れた性能を達成したことが実証された。オープンボキャブラリーセマンティックセグメンテーションにおいて、我々の手法は、A-847、A-150、PC-459、PC-59、PAS-20において、それぞれ+0.5、+2.3、+3.4、+0.4、+1.1mIoUと、従来の最先端アプローチを凌駕した。さらに、ADE20Kのパノプティック設定では、27.1PQ、73.5SQ、32.9RQの性能を達成した。コードは https://github.com/jiaosiyu1999/MAFT-Plus.git で入手可能です。

要約(オリジナル)

Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git .

arxiv情報

著者	Siyu Jiao,Hongguang Zhu,Jiannan Huang,Yao Zhao,Yunchao Wei,Humphrey Shi
発行日	2024-08-01 17:48:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー