From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

要約

例えば、クリップ、例えば、下流のタスクで印象的なゼロショット機能を示します。
以前の研究では、ランダムな作物などの視覚的増強技術の重要な役割を強調しており、大規模な言語モデル（LLM）によって生成された細粒クラスの説明とアライメントし、マルチビュー情報を組み込むことでゼロショットパフォーマンスを大幅に向上させます。
ただし、これらの増強の固有のランダム性は、必然的に背景アーティファクトを導入し、モデルがローカルの詳細に過度に焦点を合わせ、グローバルなセマンティック理解を損なう可能性があります。
これらの問題に対処するために、\ textbf {a} ttention- \ textbf {b} ased \ textbf {s}選挙（\ textbf {abs}）方法を提案します。
さらに、LLMの説明を効果的にフィルタリングするためのソフトマッチングテクニックを導入して、より良いアラインメントを提供します。
\ textBf {abs}は、分散式の一般化とゼロショット分類タスクに関する最先端のパフォーマンスを実現します。
特に、\ textBf {abs}はトレーニングなしであり、ライバルでさえ少数のショットやテスト時間の適応方法です。
私たちのコードは、\ href {https://github.com/bit-da/abs} {\ textcolor {darkgreen} {https://github.com/bit-da/abs}で入手できます。

要約(オリジナル)

Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an \textbf{A}ttention-\textbf{B}ased \textbf{S}election (\textbf{ABS}) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment. \textbf{ABS} achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably, \textbf{ABS} is training-free and even rivals few-shot and test-time adaptation methods. Our code is available at \href{https://github.com/BIT-DA/ABS}{\textcolor{darkgreen}{https://github.com/BIT-DA/ABS}}.

arxiv情報

著者	Lincan Cai,Jingxuan Kang,Shuang Li,Wenxuan Ma,Binhui Xie,Zhida Qin,Jian Liang
発行日	2025-05-19 15:15:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー