‘Let’s not Quote out of Context’: Unified Vision-Language Pretraining for Context Assisted Image Captioning

要約

マーケティング資料などの企業コンテンツ内の適切な形式のコンテキスト認識型画像キャプションとタグは、ブランドの存在感とコンテンツの想起を確実にするために重要です。
このタスクの規模と退屈さを考えると、同じことを確実に行うための手動の作成と更新は簡単ではありません。
我々は、One For All (OFA) モデルに基づく新しい統合ビジョン言語 (VL) モデルを提案します。これは、画像とそのコンテキストの両方に基づいてキャプションが生成されるコンテキスト支援画像キャプションに焦点を当てています。
私たちのアプローチは、既存のアプローチのコンテキスト非依存性 (画像とテキストが独立して扱われる) の性質を克服することを目的としています。
ニュース記事がコンテキストであるニュース画像のキャプション付け、コンテキストによる視覚的含意、コンテキストからのキーワード抽出という 3 つのタスクのデータセットを使用してモデルを事前トレーニングすることで、コンテキストを活用します。
2 番目の事前トレーニングタスクは新しい VL タスクで、1.1M と 2.2K のデータインスタンスを含むタスク用の 2 つのデータセットを構築してリリースします。
当社のシステムは、ベンチマークのニュース画像キャプションデータセットで CIDEr スコアが最大 8.34 向上し、最先端の結果を達成しました。
私たちの知る限り、私たちの取り組みは、VL タスクのモデルの事前トレーニングにコンテキスト情報を組み込む最初の取り組みです。

要約(オリジナル)

Well-formed context aware image captions and tags in enterprise content such as marketing material are critical to ensure their brand presence and content recall. Manual creation and updates to ensure the same is non trivial given the scale and the tedium towards this task. We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model, with a focus on context-assisted image captioning where the caption is generated based on both the image and its context. Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches. We exploit context by pretraining our model with datasets of three tasks: news image captioning where the news article is the context, contextual visual entailment, and keyword extraction from the context. The second pretraining task is a new VL task, and we construct and release two datasets for the task with 1.1M and 2.2K data instances. Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets. To the best of our knowledge, ours is the first effort at incorporating contextual information in pretraining the models for the VL tasks.

arxiv情報

著者	Abisek Rajakumar Kalarani,Pushpak Bhattacharyya,Niyati Chhaya,Sumit Shekhar
発行日	2023-06-01 17:34:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

‘Let’s not Quote out of Context’: Unified Vision-Language Pretraining for Context Assisted Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー