Tokenize Anything via Prompting

要約

私たちは、あらゆるものを同時にセグメント化し、認識し、キャプションを付けることができる、統合されたプロンプト対応のモデルを提供します。
SAM とは異なり、視覚的なプロンプトを介して、汎用性の高い領域表現を実際に構築することを目指しています。
これを達成するために、SA-1B マスクなどの大規模なセグメンテーションマスクと、50 億のパラメーターを含む事前トレーニング済みの CLIP モデルからのセマンティック事前分布を使用して、一般化可能なモデルをトレーニングします。
具体的には、各マスクトークンにセマンティックトークンを追加することで、プロンプト可能な画像デコーダーを構築します。
意味論的トークンは、事前定義された概念空間内の意味論的事前分布を学習する役割を果たします。
マスクトークンのセグメンテーションとセマンティックトークンの概念予測の共同最適化を通じて、私たちのモデルは強力な地域認識とローカリゼーション機能を示します。
たとえば、ゼロからトレーニングされた追加の 38M パラメーターの因果テキストデコーダーは、ビジュアルゲノム領域キャプションタスクで CIDEr スコア 150.7 の新しいレコードを設定します。
私たちは、このモデルが多用途の領域レベルの画像トークナイザーとなり、広範囲の認識タスクのための汎用領域コンテキストをエンコードできると考えています。
コードとモデルは https://github.com/baaivision/tokenize-anything で入手できます。

要約(オリジナル)

We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, e.g., SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 150.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of perception tasks. Code and models are available at https://github.com/baaivision/tokenize-anything.

arxiv情報

著者	Ting Pan,Lulu Tang,Xinlong Wang,Shiguang Shan
発行日	2023-12-14 17:01:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tokenize Anything via Prompting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー