Visually-Aware Context Modeling for News Image Captioning

要約

ニュース画像キャプションの目的は、ニュース記事と画像の両方の内容に従って画像キャプションを生成することです。
視覚情報を効果的に活用するには、記事/キャプションのコンテキストと画像の間のつながりを活用することが重要です。
心理学的研究によると、画像内の人間の顔の方が注目を集める優先順位が高いことがわかっています。
それに加えて、既存のニュース画像キャプションデータセットで発見された顔と名前の共起パターンでも証明されているように、ニュース記事では人間が中心的な役割を果たすことがよくあります。
したがって、より適切な名前の埋め込みを学習するために、画像内の顔とキャプション/記事内の名前の顔命名モジュールを設計します。
画像領域 (顔) に直接リンクできる名前とは別に、ニュース画像のキャプションには、ほとんどの場合、記事内でのみ見つけることができるコンテキスト情報が含まれています。
人間は通常、画像に基づいて記事から関連情報を検索することでこの問題に対処します。
この思考プロセスをエミュレートするために、CLIP を使用して画像に意味的に近い文を検索する検索戦略を設計します。
私たちはフレームワークの有効性を実証するために広範な実験を行っています。
追加のペアデータを使用せずに、2 つのニュース画像キャプションデータセットで新しい最先端のパフォーマンスを確立し、以前の最先端のパフォーマンスを 5 CIDEr ポイント上回りました。
承認され次第、コードをリリースします。

要約(オリジナル)

The goal of News Image Captioning is to generate an image caption according to the content of both a news article and an image. To leverage the visual information effectively, it is important to exploit the connection between the context in the articles/captions and the images. Psychological studies indicate that human faces in images draw higher attention priorities. On top of that, humans often play a central role in news stories, as also proven by the face-name co-occurrence pattern we discover in existing News Image Captioning datasets. Therefore, we design a face-naming module for faces in images and names in captions/articles to learn a better name embedding. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. Humans typically address this by searching for relevant information from the article based on the image. To emulate this thought process, we design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image. We conduct extensive experiments to demonstrate the efficacy of our framework. Without using additional paired data, we establish the new state-of-the-art performance on two News Image Captioning datasets, exceeding the previous state-of-the-art by 5 CIDEr points. We will release code upon acceptance.

arxiv情報

著者	Tingyu Qu,Tinne Tuytelaars,Marie-Francine Moens
発行日	2023-08-16 12:39:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visually-Aware Context Modeling for News Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー