GiVE: Guiding Visual Encoder to Perceive Overlooked Information

要約

マルチモーダルの大手言語モデルは、テキストからビデオへの生成や視覚的な質問応答などのアプリケーションでAIが高度になっています。
これらのモデルは、非テキストデータをベクトルに変換するために視覚エンコーダーに依存していますが、現在のエンコーダーはセマンティックアライメントを欠いているか、非適性オブジェクトを見落としています。
見過ごされている情報（与える）アプローチを知覚するために、ガイドの視覚エンコーダを提案します。
注意ガイド付きアダプター（Ag-Adapter）モジュールとオブジェクト中心の視覚セマンティック学習モジュールを使用して、視覚表現を強化します。
これらには、オブジェクトに焦点を当てた画像テキストコントラスト（OITC）損失、オブジェクト中心の画像イメージコントラスト（OIIC）損失、およびオブジェクト中心の画像識別（OIC）損失、オブジェクトの考慮の改善、検索精度、および包括性の3つの新しい損失項が組み込まれています。
私たちの貢献には、動的な視覚的焦点調整、オブジェクトの検索を強化するための新しい損失関数、およびマルチオブジェクト命令（MOINST）データセットが含まれます。
実験は、私たちのアプローチが最先端のパフォーマンスを達成することを示しています。

要約(オリジナル)

Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.

arxiv情報

著者	Junjie Li,Jianghong Ma,Xiaofeng Zhang,Yuhang Li,Jianyang Shi
発行日	2025-03-21 14:36:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー