Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

要約

私たちは、きめ細かい視覚的理解を促進するために設計された初のパノプティックピクセルレベルのキャプションデータセットである Pix2Cap-COCO を紹介します。
これを実現するために、GPT-4V が画像内の個々のオブジェクトに対してピクセルに合わせてインスタンス固有のキャプションを生成するように促す自動アノテーションパイプラインを慎重に設計し、モデルがオブジェクトとそのコンテキストの間のより詳細な関係を学習できるようにします。
このアプローチにより、167,254 個の詳細なキャプションが作成され、キャプションごとに平均 22.94 ワードが含まれます。
Pix2Cap-COCO を基盤として、パノプティックセグメンテーションキャプションという新しいタスクを導入します。これは、モデルが画像内のインスタンスを認識し、それぞれの詳細な説明を同時に提供するように要求します。
このタスクのベンチマークを行うために、X-Decoder に基づいた堅牢なベースラインを設計します。
実験結果は、Pix2Cap-COCO が、きめ細かい視覚的理解と詳細な言語生成の両方において優れたモデルを必要とするため、特に困難なデータセットであることを示しています。
さらに、大規模なマルチモーダルモデル (LMM) の教師あり微調整 (SFT) に Pix2Cap-COCO を活用して、パフォーマンスを向上させます。
たとえば、Pix2Cap-COCO を使用したトレーニングは GPT4RoI のパフォーマンスを大幅に向上させ、Visual Genome データセットで CIDEr +1.4%、ROUGE +0.4%、SPICE +0.5% の向上をもたらし、ViP-BENCH での領域理解能力を強化します。
認識精度 +11.2%、言語生成品質 +22.2% の顕著な向上を含め、全体的に +5.1% 向上しました。

要約(オリジナル)

We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation. Furthermore, we leverage Pix2Cap-COCO for Supervised Fine-Tuning (SFT) on large multimodal models (LMMs) to enhance their performance. For example, training with Pix2Cap-COCO significantly improves the performance of GPT4RoI, yielding gains in CIDEr +1.4%, ROUGE +0.4%, and SPICE +0.5% on Visual Genome dataset, and strengthens its region understanding ability on the ViP-BENCH, with an overall improvement of +5.1%, including notable increases in recognition accuracy +11.2% and language generation quality +22.2%.

arxiv情報

著者	Zuyao You,Junke Wang,Lingyu Kong,Bo He,Zuxuan Wu
発行日	2025-01-23 18:08:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー