Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation

要約

視覚的な理解は、多くの場合、イメージ、パッチ、ピクセルの 3 つの粒度レベルからアプローチされます。
自己教師あり再構成学習によって訓練されたビジュアルトークン化は、最小限の情報損失でパッチレベルのコードブックによってビジュアルデータを圧縮しますが、ビジュアルトークンは意味的な意味を持ちません。
Open Vocabulary セマンティックセグメンテーションは、強力な画像ゼロショット機能を備えた進化するビジョン言語モデル (VLM) の恩恵を受けていますが、画像レベルの理解をピクセルレベルに移行することは差し迫った課題のままです。
この論文では、セグメンテーションをピクセルのトークン化として扱い、あらゆる粒度の理解のために統合された知覚的および意味論的なトークン圧縮を研究し、その結果、オープンボキャブラリーの意味論的セグメンテーションを促進します。
低レベルの特徴が高レベルのセマンティクスに段階的に構成される事前学習済み VLM の認知プロセスを参照して、学習可能なコードブックによって多重解像度特徴をクラスタリングして表現し、それらを共同学習ピクセルによってデコードする特徴ピラミッドトークン化 (PAT) を提案します。
再構築とセマンティックセグメンテーション。
私たちは、ピクセルとセマンティック学習の疎結合ブランチを設計します。
ピクセルブランチはコードブックトークンのボトムアップ構成とトップダウン視覚化をシミュレートし、セマンティックブランチは補助的なセグメンテーションガイダンスとして階層コードブックを集合的に融合します。
私たちの実験では、PAT が VLM 機能ピラミッドのセマンティック直観を強化し、ベースラインセグメンテーションモデルを上回るパフォーマンスを向上させ、オープンボキャブラリーのセマンティックセグメンテーションベンチマークで競争力のあるパフォーマンスを達成することを示しています。
私たちのモデルは、VLM 統合に対してパラメータ効率が高く、独立したトークン化に対して柔軟性があります。
私たちはセグメンテーションの改善だけでなく、セマンティックなビジュアルトークンの利用についてもインスピレーションを与えたいと考えています。

要約(オリジナル)

The visual understanding are often approached from 3 granular levels: image, patch and pixel. Visual Tokenization, trained by self-supervised reconstructive learning, compresses visual data by codebook in patch-level with marginal information loss, but the visual tokens does not have semantic meaning. Open Vocabulary semantic segmentation benefits from the evolving Vision-Language models (VLMs) with strong image zero-shot capability, but transferring image-level to pixel-level understanding remains an imminent challenge. In this paper, we treat segmentation as tokenizing pixels and study a united perceptual and semantic token compression for all granular understanding and consequently facilitate open vocabulary semantic segmentation. Referring to the cognitive process of pretrained VLM where the low-level features are progressively composed to high-level semantics, we propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks and then decode them by joint learning pixel reconstruction and semantic segmentation. We design loosely coupled pixel and semantic learning branches. The pixel branch simulates bottom-up composition and top-down visualization of codebook tokens, while the semantic branch collectively fuse hierarchical codebooks as auxiliary segmentation guidance. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid, improves performance over the baseline segmentation model and achieves competitive performance on open vocabulary semantic segmentation benchmark. Our model is parameter-efficient for VLM integration and flexible for the independent tokenization. We hope to give inspiration not only on improving segmentation but also on semantic visual token utilization.

arxiv情報

著者	Jianyu Zhang,Li Zhang,Shijian Li
発行日	2024-12-18 18:43:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー