It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

要約

プラトニック表現仮説は、モデルとデータセットのサイズが増加するにつれて、ビジョンと言語の埋め込みがより均一になることを示唆しています。
特に、各モダリティ内のペアワイズ距離はより類似しています。
これは、基礎モデルが成熟するにつれて、視覚と言語の埋め込みを完全に監視されていない方法で、つまり並列データなしで一致させることが可能になる可能性があることを示唆しています。
最初の実現可能性調査を提示し、監視されていない、または「ブラインド」のマッチングの文脈における既存のビジョンおよび言語基礎モデルの適合性を調査します。
まず、監視されていないマッチングを二次割り当ての問題として策定し、以前のソルバーを上回る新しいヒューリスティックを導入します。
また、最適なマッチングの問題を見つけるための手法を開発します。
第二に、4つのデータセットにさまざまなビジョンモデルと言語モデルを展開する広範な研究を実施します。
私たちの分析は、多くの問題の例で、監督なしでビジョンと言語の表現を実際に一致させることができることを明らかにしています。
この発見は、セマンティック知識を他のモダリティに事実上注釈なしに埋め込むという刺激的な可能性を開きます。
概念実証として、監視されていない分類器を紹介します。これは、画像テキストの注釈なしで非自明の分類精度を実現します。

要約(オリジナル)

The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or ‘blind’, matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

arxiv情報

著者	Dominik Schnaus,Nikita Araslanov,Daniel Cremers
発行日	2025-03-31 14:14:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー