Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

要約

Vision Transformersは、処理前に画像を均一なサイズのチャンクにパッチ化する先例を確立しています。
この設計の選択は、視覚データから包括的な構成表現を学習する際のモデルを制限する可能性があると仮定します。
このペーパーでは、ビジョン言語のトレーニング前のフレームワーク内で、意味的に意味のある視覚トークンを変圧器エンコーダーに提供するという概念を探ります。
既製のセグメンテーションとシーングラフモデルを活用して、インスタンスセグメンテーションマスク（有形トークンと呼ばれる）と関係とアクション（無形トークンと呼ばれる）の表現を抽出します。
その後、これらの新しく抽出されたトークンを組み込み、結果の埋め込みをテキスト側エンコーダーからのキャプション埋め込みで整列させることにより、ビジョン側の変圧器を事前にトレーニングします。
視覚トークン間の構造的およびセマンティックな関係をキャプチャするために、自己関節スコアを計算するために使用される加法の注意力を導入します。
COCOでの実験は、テキストからイメージ（+47％）と画像からテキストへの検索（+44％）のタスク全体で学習された表現品質のvitsよりも顕著な改善を示しています。
さらに、ARO（+18％）やWinoground（+10％）などの構成性ベンチマークの利点を紹介します。

要約(オリジナル)

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).

arxiv情報

著者	Neha Kalibhat,Priyatham Kattakinda,Sumit Nawathe,Arman Zarei,Nikita Seleznev,Samuel Sharpe,Senthil Kumar,Soheil Feizi
発行日	2025-05-19 16:00:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー