FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

要約

視覚的理解は本質的に文脈的です – 画像で焦点を当てるものは、目前のタスクに依存します。
たとえば、花の花束を持っている人のイメージを考えると、興味のある文脈に応じて、衣服などの人や花の種類のいずれかに焦点を当てることができます。
しかし、ほとんどの既存の画像エンコードパラダイムは、さまざまな下流のユースケースに対してさまざまな視覚情報を優先する潜在的なニーズを見落とす固定された一般的な特徴ベクトルとしての画像を表しています。
この作業では、自然言語を通して柔軟に表現された関心のあるコンテキストに基づいて、同じ画像の異なる表現を生成する条件付き視覚エンコード方法であるFocallensを紹介します。
ビジョン命令の調整データを活用し、条件付き画像表現を生成するための追加の入力として自然言語の指示を取得するために、前処理されたビジョンエンコーダーを対象に微調整します。
広範な実験は、Focallensからの条件付き画像表現が、Clipなどの標準的なビジョンエンコーダーによって生成される一般的な特徴と比較して、関心のある視覚的特徴をよりよく発音することを検証します。
さらに、Focallensがさらに、画像イメージの検索、画像分類、画像テキストの検索など、さまざまなダウンストリームタスクのパフォーマンスの改善につながり、それぞれ困難なSugarCrepeおよびMMVP-VLMベンチマークで5ポイントと10ポイントの平均ゲインがあります。

要約(オリジナル)

Visual understanding is inherently contextual — what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.

arxiv情報

著者	Cheng-Yu Hsieh,Pavan Kumar Anasosalu Vasu,Fartash Faghri,Raviteja Vemulapalli,Chun-Liang Li,Ranjay Krishna,Oncel Tuzel,Hadi Pouransari
発行日	2025-04-11 09:07:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー