TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

要約

視覚言語モデルの学習の核心は、視覚データと言語データから意味的に整合した情報を抽出することです。
既存の試みは通常、粗い位置合わせの問題、\textit{e.g.}、ビジョンエンコーダが属性指定されたオブジェクトの位置を特定するのに苦労するという問題に直面します。
この研究では、画像とテキストのペア以外の追加のデータ形式を必要とせずに、画像とテキストの特徴をより適切に調整するための、恥ずかしいほど単純なアプローチを提案します。
具体的には、画像とそのペアのテキストが与えられた場合、画像内に存在する可能性が高いオブジェクト (\textit{例}、猫) と属性 (\textit{例}、黒) を説明から解析できます。
解析パイプラインが完全に自動化されているため、優れた拡張性を備えていることは注目に値します。
これらの解析されたセマンティクスを監視信号として使用すると、一般的に使用される画像とテキストのコントラスト損失をマルチタグ分類損失で補完できます。
セマンティックセグメンテーションデータセットの広範なスイートに関する広範な実験結果により、既存の代替フレームワークと比較して、当社のフレームワークが平均 3.65% 向上していることが実証されています。
さらに、視覚化の結果は、属性監視により視覚言語モデルが属性指定オブジェクトの位置を正確に特定できることを示しています。
プロジェクトページは https://qinying-liu.github.io/Tag-Align/ にあります。

要約(オリジナル)

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, \textit{e.g.}, the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (\textit{e.g.}, cat) and attributes (\textit{e.g.}, black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 3.65\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align/

arxiv情報

著者	Qinying Liu,Kecheng Zheng,Wu Wei,Zhan Tong,Yu Liu,Wei Chen,Zilei Wang,Yujun Shen
発行日	2023-12-21 18:59:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー