3VL: Using Trees to Improve Vision-Language Models’ Interpretability

要約

ビジョン言語モデル (VLM) は、画像とテキスト表現を調整するのに効果的であり、多くの下流タスクに転送されたときに優れたゼロショット結果を生み出すことが証明されています。
ただし、これらの表現には、オブジェクトの属性、状態、異なるオブジェクト間の関係の認識など、構成言語概念 (CLC) を理解する上でいくつかの重要な欠点があります。
さらに、VLM は通常、解釈可能性が低いため、デバッグや構成理解の失敗を軽減することが困難になります。
この研究では、私たちが提案するアンカー推論方法とDiRe（DiRe）解釈ツールを伴うツリー拡張ビジョン言語（3VL）モデルのアーキテクチャとトレーニング手法を紹介します。
3VL は、言語分析ツールを使用して任意の画像とテキストのペアのテキストを階層ツリー構造に拡張することにより、モデルによって学習された視覚表現にこの構造を導入することを可能にし、その解釈可能性と構成推論を強化します。
さらに、テキスト統合のためのシンプルな手法であるアンカーを使用して、迷惑要因をフィルタリングしながら、たとえば基本的な VL-Checklist ベンチマークで CLC 理解パフォーマンスを向上させる方法を示します。
また、VLM 関連性マップ間の差分比較を実行する DiRe を使用して、モデルの成功または失敗の理由の説得力のある視覚化をどのように生成できるかについても説明します。
私たちのコードは https://github.com/niryellinek/3VL で入手できます。

要約(オリジナル)

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects’ attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model’s success or failure. Our code is available at: https://github.com/niryellinek/3VL.

arxiv情報

著者	Nir Yellinek,Leonid Karlinsky,Raja Giryes
発行日	2025-01-15 12:46:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3VL: Using Trees to Improve Vision-Language Models’ Interpretability

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー