Generalized Decoding for Pixel, Image, and Language

要約

X-Decoder は、ピクセルレベルのセグメンテーションと言語トークンをシームレスに予測できる一般化されたデコードモデルです。
X-Decodert は、(i) 一般的な非セマンティッククエリと (ii) テキスト入力から誘導されるセマンティッククエリの 2 種類のクエリを入力として取り、同じセマンティック空間で異なるピクセルレベルおよびトークンレベルの出力をデコードします。
このような斬新な設計により、X-Decoder は、あらゆる種類の画像セグメンテーションとさまざまな視覚言語 (VL) タスクをサポートする統一された方法を提供する最初の作品です。
さらに、私たちの設計は、さまざまな粒度でタスク間のシームレスな相互作用を可能にし、疑似ラベリングなしで、共通の豊富なピクセルレベルの視覚的セマンティック理解スペースを学習することにより、相互の利益をもたらします。
限られた量のセグメンテーションデータと数百万の画像テキストペアの混合セットで事前トレーニングした後、X-Decoder は、ゼロショット設定と微調整設定の両方で、幅広いダウンストリームタスクに強力な転送可能性を示します。
特に、(1) 8 つのデータセットのオープン語彙セグメンテーションと参照セグメンテーションで最先端の結果を達成します。
(2) セグメンテーションおよび VL タスクに関する他のジェネラリストおよびスペシャリストモデルよりも優れた、または競争力のある微調整されたパフォーマンス。
(3) 効率的な微調整と斬新なタスク構成 (例: キャプションや画像編集の参照) のための柔軟性。
コード、デモ、ビデオ、視覚化は、https://x-decoder-vl.github.io で入手できます。

要約(オリジナル)

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.

arxiv情報

著者	Xueyan Zou,Zi-Yi Dou,Jianwei Yang,Zhe Gan,Linjie Li,Chunyuan Li,Xiyang Dai,Harkirat Behl,Jianfeng Wang,Lu Yuan,Nanyun Peng,Lijuan Wang,Yong Jae Lee,Jianfeng Gao
発行日	2022-12-21 18:58:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generalized Decoding for Pixel, Image, and Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー