HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

要約

事前トレーニングされたモデルを使用して画像をエンコードする方法の研究により、画像キャプションは大幅に進歩しました。
これには、視覚的なエンコーディング (画像グリッドの特徴や検出されたオブジェクトなど) と、最近ではテキストのエンコーディング (画像タグや画像領域のテキストの説明など) が含まれます。
より高度なエンコーディングが利用可能になり組み込まれるようになると、異種エンコーディングのセットを効率的かつ効果的に活用するにはどうすればよいかという疑問が生じるのは自然なことです。
この論文では、エンコーディングを入力画像の拡張ビューとみなすことを提案します。
画像キャプションモデルは、共有エンコーダを使用して各ビューを独立して効率的にエンコードし、新しい方法でエンコードされたビュー全体にコントラスト損失を組み込んで、表現品質とモデルのデータ効率を向上させます。
次に、私たちが提案する階層デコーダは、まず各ビュー内でトークンレベルで集約し、次にビューレベルでビュー全体を集約することにより、キャプション生成の有効性に応じてエンコードされたビューを適応的に重み付けします。
当社は、最先端技術と比較して、MS-COCO では +5.6% CIDEr、Flickr30k では +12.9% CIDEr という大幅なパフォーマンス向上を実証し、設計の各部分の重要性を実証するために厳密な分析を実施しています。

要約(オリジナル)

A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model’s data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.

arxiv情報

著者	Chia-Wen Kuo,Zsolt Kira
発行日	2023-05-25 17:50:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー