Learning Aligned Cross-modal Representations for Referring Image Segmentation

要約

画像セグメンテーションの参照は、特定の言語表現に従って関心のある画像領域をセグメント化することを目的としています。これは、典型的なマルチモーダルタスクです。
このタスクの重要な課題の 1 つは、視覚や言語を含むさまざまなモダリティのセマンティック表現を調整することです。
これを実現するために、以前の方法では、クロスモーダルインタラクションを実行して視覚的特徴を更新しますが、きめの細かい視覚的特徴を言語的特徴に統合する役割を無視していました。
画像セグメンテーションを参照するためのエンドツーエンドのフレームワークである AlignFormer を紹介します。
私たちの AlignFormer は、言語的特徴を中央の埋め込みと見なし、中央の埋め込みに基づくピクセルグループによって関心領域をセグメント化します。
ピクセルとテキストの配置を実現するために、視覚言語双方向注意モジュール (VLBA) を設計し、対照的な学習に頼ります。
具体的には、VLBA はセマンティックテキスト表現を各ピクセルに伝播することによって視覚的特徴を強化し、きめの細かい画像特徴を融合することによって言語的特徴を促進します。
さらに、クロスモーダルインスタンスコントラスト損失を導入して、あいまいな領域のピクセルサンプルの影響を軽減し、マルチモーダル表現を整列させる機能を向上させます。
広範な実験により、AlignFormer が RefCOCO、RefCOCO+、および RefCOCOg で新しい最先端のパフォーマンスを大幅に達成することが実証されています。

要約(オリジナル)

Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. One of the critical challenges of this task is to align semantic representations for different modalities including vision and language. To achieve this, previous methods perform cross-modal interactions to update visual features but ignore the role of integrating fine-grained visual features into linguistic features. We present AlignFormer, an end-to-end framework for referring image segmentation. Our AlignFormer views the linguistic feature as the center embedding and segments the region of interest by pixels grouping based on the center embedding. For achieving the pixel-text alignment, we design a Vision-Language Bidirectional Attention module (VLBA) and resort contrastive learning. Concretely, the VLBA enhances visual features by propagating semantic text representations to each pixel and promotes linguistic features by fusing fine-grained image features. Moreover, we introduce the cross-modal instance contrastive loss to alleviate the influence of pixel samples in ambiguous regions and improve the ability to align multi-modal representations. Extensive experiments demonstrate that our AlignFormer achieves a new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg by large margins.

arxiv情報

著者	Zhichao Wei,Xiaohao Chen,Mingqiang Chen,Siyu Zhu
発行日	2023-02-10 08:59:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Aligned Cross-modal Representations for Referring Image Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー