ViTOC: Vision Transformer and Object-aware Captioner

要約

この論文では、生成された説明の精度と多様性の課題に対処する、画像キャプション用の新しいビジョン言語モデルである ViTOC (Vision Transformer and Object-aware Captioner) について紹介します。
従来のアプローチとは異なり、ViTOC はビジョントランスフォーマーとオブジェクト検出器に基づくデュアルパスアーキテクチャを採用し、学習可能なベクトルを通じてグローバルな視覚特徴とローカルなオブジェクト情報を効果的に融合します。
このモデルには、ロングテールデータの処理能力を大幅に強化する革新的なオブジェクト認識プロンプト戦略が導入されています。
標準 COCO データセットの実験では、ViTOC がすべての評価指標にわたってベースラインモデルを上回るパフォーマンスを示しています。
さらに、モデルの有効性をさらに検証するために、CLIP に基づくリファレンスフリーの評価方法を提案します。
事前トレーニングされたビジュアルモデルパラメーターを利用することで、ViTOC は効率的なエンドツーエンドトレーニングを実現します。

要約(オリジナル)

This paper presents ViTOC (Vision Transformer and Object-aware Captioner), a novel vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions. Unlike conventional approaches, ViTOC employs a dual-path architecture based on Vision Transformer and object detector, effectively fusing global visual features and local object information through learnable vectors. The model introduces an innovative object-aware prompting strategy that significantly enhances its capability in handling long-tail data. Experiments on the standard COCO dataset demonstrate that ViTOC outperforms baseline models across all evaluation metrics. Additionally, we propose a reference-free evaluation method based on CLIP to further validate the model’s effectiveness. By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.

arxiv情報

著者	Feiyang Huang
発行日	2024-11-27 15:45:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViTOC: Vision Transformer and Object-aware Captioner

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー