Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

要約

現在の大規模マルチモーダルモデル (LMM) は、モデルが言語コンポーネントを視覚的エンティティに関連付ける必要があるため、グラウンディングという課題に直面しています。
追加の接地監視によって LMM を微調整する一般的な手法とは異なり、明示的な接地監視なしで訓練された LMM には実際に接地能力が発現する可能性があることがわかりました。
この新たな根拠を明らかにするために、標準 LMM のアテンションマップを活用してピクセルレベルのセグメンテーションを実行する「アテンドアンドセグメント」手法を導入します。
さらに、接地能力を強化するために、標準の CLIP ビジュアルエンコーダとは対照的に、拡散ベースのビジュアルエンコーダを利用し、同じ弱い監視でトレーニングされた LMM である DIFFLMM を提案します。
接地固有の監視データのバイアスや限られた規模に制約されることなく、私たちのアプローチはより一般化可能で拡張可能です。
当社は、グラウンディング固有のベンチマークと一般的な視覚的質問応答ベンチマークの両方で、それぞれグラウンディング LMM と汎用 LMM と比較して、競争力のあるパフォーマンスを達成しています。
特に、接地の監視なしで接地された会話の生成で 44.2 の接地マスク再現率を達成し、広範囲に監視されたモデル GLaMM を上回りました。
プロジェクトページ: https://groundLMM.github.io。

要約(オリジナル)

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an ‘attend-and-segment’ method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://groundLMM.github.io.

arxiv情報

著者	Shengcao Cao,Liang-Yan Gui,Yu-Xiong Wang
発行日	2024-10-10 17:59:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー