URECA: Unique Region Caption Anything

要約

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features.
ただし、既存の方法は、多粒度全体で独自のキャプションを作成するのに苦労しており、実際の適用性を制限しています。
To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning.
Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements.
これの中心は、段階的なデータキュレーションパイプラインで、各ステージは地域の選択とキャプションの生成を徐々に改良します。
By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity.
このデータセットに基づいて、多粒度領域を効果的にエンコードするように設計された新しいキャプションモデルであるURECAを提示します。
URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions.
私たちのアプローチでは、ダイナミックマスクモデリングと高解像度マスクエンコーダーを導入して、キャプションの一意性を高めます。
Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

要約(オリジナル)

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

arxiv情報

著者 Sangbeom Lim,Junwan Kim,Heeji Yoon,Jaewoo Jung,Seungryong Kim
発行日 2025-04-07 17:59:44+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV パーマリンク