Instruction-Guided Visual Masking

要約

現代の LLM では、指示に従うことが極めて重要です。
ただし、マルチモーダル設定に拡張すると、特定のテキスト命令と画像の対象となる局所領域との間のずれが生じることがよくあります。
より正確で微妙なマルチモーダル指示を実現するために、LMM やロボットモデルなどの多様なマルチモーダルモデルと互換性のある新しい多用途ビジュアルグラウンディングモデルである命令ガイド付きビジュアルマスキング (IVM) を導入します。
命令に無関係な領域のビジュアルマスクを構築することにより、IVM で強化されたマルチモーダルモデルは、タスクに関連した画像領域に効果的に焦点を当て、複雑な命令とより適切に連携することができます。
具体的には、ビジュアルマスキングデータ生成パイプラインを設計し、100 万の画像命令ペアを含む IVM-Mix-1M データセットを作成します。
さらに、高品質のデータサンプルを優先する優先 IVM トレーニング用の新しい学習手法である Discriminator Weighted Supervised Learning (DWSL) を導入します。
VQA や組み込みロボット制御などの一般的なマルチモーダルタスクに関する実験結果は、IVM の多用途性を実証しています。IVM は、プラグアンドプレイツールとして、多様なマルチモーダルモデルのパフォーマンスを大幅に向上させ、困難な課題全体にわたって新しい最先端の結果をもたらします。
マルチモーダルベンチマーク。
コード、モデル、データは https://github.com/2toinf/IVM で入手できます。

要約(オリジナル)

Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code, model and data are available at https://github.com/2toinf/IVM.

arxiv情報

著者	Jinliang Zheng,Jianxiong Li,Sijie Cheng,Yinan Zheng,Jiaming Li,Jihao Liu,Yu Liu,Jingjing Liu,Xianyuan Zhan
発行日	2024-10-16 09:28:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Instruction-Guided Visual Masking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー