Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

要約

マルチモーダル大規模言語モデル (MLLM) は、さまざまなタスクにわたるきめ細かい視覚的理解において目覚ましい成功を収めています。
しかし、きめの細かい知識の調整が不十分であるため、多くの場合、重大な課題に直面します。そのため、地域の詳細を正確に把握し、包括的な世界的認識を達成する能力が制限されます。
最近の進歩は、オブジェクトの表現を根拠となる情報と一致させることに重点を置いていますが、通常、単なるテキストや座標を超えた豊富な情報を含むオブジェクト画像の明示的な統合が欠けています。
このギャップを埋めるために、テキスト、座標、画像などのオブジェクトのマルチスケールの知識を効果的に調整して統合する、新しいきめの細かい視覚的知識の調整方法を導入します。
この革新的な手法は、マルチスケールのきめ細かい拡張データ合成パイプラインによって支えられており、調整を強化して全体的なパフォーマンスを向上させるために 300,000 を超える重要なトレーニングデータを提供します。
さらに、高度なアライメントに最適化されたコンパクトなモデルである TinyGroundingGPT シリーズを紹介します。
約 3B パラメータのスケールを持つ TinyGroundingGPT は、複雑なビジュアルシナリオでより大きな MLLM に匹敵するパフォーマンスを提供しながら、グラウンディングタスクで優れた結果を達成します。

要約(オリジナル)

Multi-modal large language models (MLLMs) have achieved remarkable success in fine-grained visual understanding across a range of tasks. However, they often encounter significant challenges due to inadequate alignment for fine-grained knowledge, which restricts their ability to accurately capture local details and attain a comprehensive global perception. While recent advancements have focused on aligning object expressions with grounding information, they typically lack explicit integration of object images, which contain affluent information beyond mere texts or coordinates. To bridge this gap, we introduce a novel fine-grained visual knowledge alignment method that effectively aligns and integrates multi-scale knowledge of objects, including texts, coordinates, and images. This innovative method is underpinned by our multi-scale fine-grained enhancement data synthesis pipeline, which provides over 300K essential training data to enhance alignment and improve overall performance. Furthermore, we present TinyGroundingGPT, a series of compact models optimized for high-level alignments. With a scale of approximately 3B parameters, TinyGroundingGPT achieves outstanding results in grounding tasks while delivering performance comparable to larger MLLMs in complex visual scenarios.

arxiv情報

著者	Wei Wang,Zhaowei Li,Qi Xu,Linfeng Li,YiQing Cai,Botian Jiang,Hang Song,Xingcan Hu,Pengyu Wang,Li Xiao
発行日	2024-11-14 18:57:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー