Emergent Visual-Semantic Hierarchies in Image-Text Representations

要約

CLIP のような最近の視覚および言語モデル (VLM) は、共有された意味論的空間内のテキストと画像を分析するための強力なツールですが、画像を記述するテキストのセットの階層的な性質を明示的にモデル化するものではありません。
逆に、既存のマルチモーダル階層表現学習方法では、コストがかかるゼロからのトレーニングが必要となり、最先端のマルチモーダル基礎モデルによってエンコードされた知識を活用できません。
この研究では、既存の基礎モデルの知識を研究し、この目的のために直接トレーニングされていないにもかかわらず、それらが視覚的意味論的階層の創発的な理解を示していることを発見しました。
私たちは、階層理解を精査し、最適化するための Radial Embedding (RE) フレームワークを提案し、大規模な言語モデルを介して自動的に構築される、画像とテキスト表現における階層的知識の研究を促進するベンチマークである HierarCaps データセットに貢献します。
私たちの結果は、基礎 VLM がゼロショット階層理解を示し、この目的のために明示的に設計された以前のモデルのパフォーマンスを上回ることを示しています。
さらに、事前トレーニングの知識を保持しながら、テキストのみの微調整フェーズを通じて基礎モデルが階層的推論によりよく適合する可能性があることを示します。

要約(オリジナル)

While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image–text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge.

arxiv情報

著者	Morris Alper,Hadar Averbuch-Elor
発行日	2024-07-11 14:09:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Emergent Visual-Semantic Hierarchies in Image-Text Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー