M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base

要約

マルチモーダルナレッジベース (MMKB) は、マルチモーダルタスクに不可欠なクロスモーダルに調整された知識を提供します。
ただし、既存の MMKB 内の画像は通常、百科事典のナレッジグラフ内のエンティティに対して収集されます。
したがって、マルチモーダルモデルの視覚概念認識能力に不可欠な、言語概念を伴う視覚意味論の詳細な基礎が不足しています。
このギャップに対処するために、最初のコンセプト中心の MMKB である M^2ConceptBase を導入します。
M^2ConceptBase は、関連する画像と詳細なテキスト説明を備えたノードとしてコンセプトをモデル化します。
私たちは、画像とテキストのデータセットからのコンテキスト情報を使用して、コンセプトと画像とコンセプトと説明のペアを調整する、コンテキストを意識したマルチモーダルシンボルグラウンディングアプローチを提案します。
951K の画像と 152K のコンセプトで構成される M^2ConceptBase は、各コンセプトを平均 6.27 枚の画像と 1 つの説明にリンクし、包括的な視覚的およびテキストのセマンティクスを保証します。
人間による研究では 95% 以上のアライメント精度が確認されており、その品質が強調されています。
さらに、私たちの実験では、M^2ConceptBase が OK-VQA タスクにおける VQA モデルのパフォーマンスを大幅に向上させることが実証されました。
また、M^2ConceptBase は、2 つの概念関連タスクにおける検索拡張を通じて、マルチモーダル大規模言語モデルのきめ細かい概念理解機能を大幅に向上させ、その価値を強調します。

要約(オリジナル)

Multimodal knowledge bases (MMKBs) provide cross-modal aligned knowledge crucial for multimodal tasks. However, the images in existing MMKBs are generally collected for entities in encyclopedia knowledge graphs. Therefore, detailed groundings of visual semantics with linguistic concepts are lacking, which are essential for the visual concept cognition ability of multimodal models. Addressing this gap, we introduce M^2ConceptBase, the first concept-centric MMKB. M^2ConceptBase models concepts as nodes with associated images and detailed textual descriptions. We propose a context-aware multimodal symbol grounding approach to align concept-image and concept-description pairs using context information from image-text datasets. Comprising 951K images and 152K concepts, M^2ConceptBase links each concept to an average of 6.27 images and a single description, ensuring comprehensive visual and textual semantics. Human studies confirm more than 95% alignment accuracy, underscoring its quality. Additionally, our experiments demonstrate that M^2ConceptBase significantly enhances VQA model performance on the OK-VQA task. M^2ConceptBase also substantially improves the fine-grained concept understanding capabilities of multimodal large language models through retrieval augmentation in two concept-related tasks, highlighting its value.

arxiv情報

著者	Zhiwei Zha,Jiaan Wang,Zhixu Li,Xiangru Zhu,Wei Song,Yanghua Xiao
発行日	2024-08-01 08:03:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー