ComiCap: A VLMs pipeline for dense captioning of Comic Panels

要約

コミック領域は、単一ページおよび複数ページの分析および合成モデルの開発により急速に進歩しています。
最近のベンチマークとデータセットは、検出 (パネル、キャラクター、テキスト)、リンク (キャラクターの再識別や話者識別)、漫画要素の分析 (会話の文字起こしなど) などのタスクにおけるモデルの機能をサポートおよび評価するために導入されました。
ただし、ストーリーラインを包括的に理解するには、モデルは要素を抽出するだけでなく、それらの関係を理解し、非常に有益なキャプションを生成する必要があります。
この研究では、視覚言語モデル (VLM) を活用して、緻密で根拠のあるキャプションを取得するパイプラインを提案します。
パイプラインを構築するために、すべての重要な属性がキャプションで識別されているかどうかを評価する属性保持メトリックを導入します。
さらに、オープンソース VLM を公正に評価し、指標に従って最適なキャプションモデルを選択するために、高密度の注釈付きテストセットを作成しました。
私たちのパイプラインは、追加のトレーニングを必要とせずに、特別にトレーニングされたモデルによって生成されたキャプションよりも量的および質的に優れた境界ボックスを含む高密度のキャプションを生成します。
このパイプラインを使用して、13,000 冊の書籍にわたる 200 万枚を超えるパネルに注釈を付けました。これらのパネルは、プロジェクトページ https://github.com/emanuelevivoli/ComiCap で利用できるようになります。

要約(オリジナル)

The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models’ capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-Language Models (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test set to fairly evaluate open-source VLMs and select the best captioning model according to our metric. Our pipeline generates dense captions with bounding boxes that are quantitatively and qualitatively superior to those produced by specifically trained models, without requiring any additional training. Using this pipeline, we annotated over 2 million panels across 13,000 books, which will be available on the project page https://github.com/emanuelevivoli/ComiCap.

arxiv情報

著者	Emanuele Vivoli,Niccolò Biondi,Marco Bertini,Dimosthenis Karatzas
発行日	2024-09-24 14:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ComiCap: A VLMs pipeline for dense captioning of Comic Panels

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー