GLaMM: Pixel Grounding Large Multimodal Model

要約

ラージ・マルチモーダルモデル（LMM）はラージ・ランゲージモデルを視覚領域に拡張したものである。LMMへの初期の取り組みでは、全体的な画像とテキストプロンプトを用いて、根拠のないテキスト応答を生成した。ごく最近では、領域レベルのLMMが視覚的に根拠のある応答を生成するために使用されている。しかし、これらは一度に単一のオブジェクトカテゴリを参照することしかできず、ユーザは入力で領域を指定する必要がある。本研究では、対応する物体分割マスクとシームレスに絡み合った自然言語応答を生成できる初めてのモデルである、グラウンディングLMM（GLaMM）を発表する。GLaMMは会話に登場するオブジェクトのグラウンディングを行うだけでなく、入力としてテキストとオプションの視覚的プロンプト（関心領域）の両方を受け付ける柔軟性を持つ。これにより、ユーザーはテキストとビジュアルの両方の領域において、様々な粒度レベルでモデルと対話することができる。視覚的にグラウンディングされた詳細な会話を生成するという斬新な設定に対する標準的なベンチマークがないため、我々のキュレーションしたグラウンディングされた会話による包括的な評価プロトコルを紹介する。我々の提案するGrounded Conversation Generation (GCG)タスクは、大規模な自然シーンにおいて、高密度にグラウンディングされた概念を必要とする。この目的のために、我々は提案する自動アノテーションパイプラインを用いて、セグメンテーションマスクで利用可能な合計810Mの領域に接地された750万のユニークな概念を包含する、高密度にアノテーションされたGrounding-anything Dataset (GranD)を提案する。GLaMMはGCGの他にも、参照表現のセグメンテーション、画像や領域レベルのキャプション、視覚言語会話など、いくつかの下流タスクで効果的な性能を発揮しています。プロジェクトページ: https://mbzuai-oryx.github.io/groundingLMM.

要約(オリジナル)

Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial efforts towards LMMs used holistic images and text prompts to generate ungrounded textual responses. Very recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring a single object category at a time, require users to specify the regions in inputs, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of generating visually grounded detailed conversations, we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed Grounded Conversation Generation (GCG) task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks e.g., referring expression segmentation, image and region-level captioning and vision-language conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.

arxiv情報

著者	Hanoona Rasheed,Muhammad Maaz,Sahal Shaji,Abdelrahman Shaker,Salman Khan,Hisham Cholakkal,Rao M. Anwer,Erix Xing,Ming-Hsuan Yang,Fahad S. Khan
発行日	2023-11-06 18:59:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

GLaMM: Pixel Grounding Large Multimodal Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー