GroundingGPT:Language Enhanced Multi-modal Grounding Model

要約

マルチモーダル大規模言語モデルは、さまざまなモダリティのさまざまなタスクにわたって優れたパフォーマンスを実証しています。
しかし、既存のマルチモーダルモデルは主に各モダリティ内でグローバルな情報を取得することに重点を置き、モダリティ全体でローカルな情報を認識する重要性を無視しています。
その結果、これらのモデルには入力データのきめ細かい詳細を効果的に理解する能力が欠けており、より微妙な理解を必要とするタスクでのパフォーマンスが制限されます。
この制限に対処するには、複数のモダリティにわたるきめ細かい理解を可能にし、それによって幅広いタスクへの適用性を高めるモデルを開発することが切実な必要性があります。
この論文では、言語強化されたマルチモーダルグラウンディングモデルである GroundingGPT を提案します。
他のマルチモーダルモデルのようにグローバルな情報を取得するだけでなく、私たちが提案するモデルは、入力内のローカルな情報の詳細な理解を必要とするタスクに優れています。
画像内の特定の領域またはビデオ内の瞬間の正確な識別と位置特定を示します。
この目的を達成するために、多様なデータセット構築パイプラインを設計し、その結果、モデルトレーニング用のマルチモーダル、マルチ粒度のデータセットが得られます。
私たちのモデルのコード、データセット、デモは https://github.com/lzw-lzw/GroundingGPT にあります。

要約(オリジナル)

Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose GroundingGPT, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/GroundingGPT.

arxiv情報

著者	Zhaowei Li,Qi Xu,Dong Zhang,Hang Song,Yiqing Cai,Qi Qi,Ran Zhou,Junting Pan,Zefeng Li,Van Tu Vu,Zhida Huang,Tao Wang
発行日	2024-03-05 14:36:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GroundingGPT:Language Enhanced Multi-modal Grounding Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー