Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

要約

この論文では、人間が注釈を付けたキャプションと Web で収集したキャプションの両方を含む、データソースの不均一な組み合わせをトレーニングすることによって、流暢な説明を生成するタスクに取り組みます。
確かに、ノイズの多い画像とテキストのペアを含む大規模なデータセットは、その低品質な記述スタイルのため、最適な監視ソースとはなりませんが、人間によるアノテーションが付けられたデータセットはよりクリーンですが、規模は小さくなります。
両方の利点を最大限に活用するために、スタイルトークンと検索コンポーネントを通じて抽出されたキーワードを組み込むことで、セマンティクスと記述スタイルを活用し、分離することを提案します。
提案されたモデルは、オブジェクト検出器の必要性を回避し、プロンプト言語モデリングという単一の目的でトレーニングされ、異なる入力スタイルのソースでトレーニングしながら人間が収集したキャプションのスタイルを複製できます。
実験的に、このモデルは現実世界の概念を認識し、高品質のキャプションを生成する強力な機能を示しています。
CC3M、nocaps、競合する COCO データセットなど、さまざまな画像キャプションデータセットに対して広範な実験が行われており、当社のモデルは常にベースラインや最先端のアプローチを上回っています。

要約(オリジナル)

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

arxiv情報

著者	Marcella Cornia,Lorenzo Baraldi,Giuseppe Fiameni,Rita Cucchiara
発行日	2023-11-30 11:47:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー