Image Embedding Sampling Method for Diverse Captioning

要約

最先端のVLMSの画像キャプションは、時間とともに大幅に改善されました。
ただし、これには計算の複雑さが増加するため、モバイルデバイスや支援技術などのリソース制約のあるアプリケーションではアクセスしにくいものになります。
あるいは、より小さなVLMSが高レベルのシーンの説明を優先し、画像のより豊かな理解に寄与するより細かい詳細を見下ろしています。
この論文では、バックボーンとして同等の小さなVLM、BLIPを使用して異なる画像領域に明示的に参加することにより、キャプションの多様性と情報性を高めるトレーニングフリーのフレームワークを紹介します。
私たちのアプローチは、構造化されたセグメンテーションを活用して、グローバルとローカライズされたセマンティクスの両方をキャプチャする階層表現を生成します。
追加のモデルトレーニングを必要とせずに、私たちの方法により、より小さなVLMが画像キャプションのアライメント、セマンティックの完全性、多様性の点で、より大きなモデルに匹敵するパフォーマンスを実現できることを実証します。
MSCOCO、FlickR30K、およびNOCAPSテストデータセットに関するフレームワークを評価し、各データセットでそれぞれ0.735、0.750、および0.748のDIV-2スコアを達成し、人間が発生したキャプションとの強力な画像キャプションの関連性とセマンティックの完全性を維持します。

要約(オリジナル)

Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.

arxiv情報

著者	Sania Waheed,Na Min An
発行日	2025-02-14 12:33:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Image Embedding Sampling Method for Diverse Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー