Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

要約

ビジュアルキャプションの分野では、事前学習された特徴量と、自動回帰モデルへの豊富な入力となる固定オブジェクト検出器に大きく依存して、大きな進展があった。しかし、このような方法の主な限界は、モデルの出力がオブジェクト検出器の出力のみを条件とすることである。このような出力が必要な情報をすべて表現できるという仮定は、特に検出器がデータセット間で転送される場合、非現実的である。本研究では、この仮定によって誘発されるグラフィカルモデルを推論し、オブジェクトの関係などの欠落した情報を表現するために補助入力を追加することを提案する。特に、Visual Genomeデータセットから属性と関係をマイニングし、それらに基づいてキャプションモデルを条件付けることを提案する。重要なのは、このような文脈的記述を取得するために、マルチモーダル事前学習モデル（CLIP）の使用を提案（重要であることを示す）することである。さらに、オブジェクト検出モデルは凍結されており、キャプションモデルが適切に接地できるような十分なリッチネスを持っていない。その結果、我々は検出器と説明文の両方の出力を画像に条件付けることを提案し、これにより接地性を改善できることを定性的に、また定量的に示す。本手法を画像キャプションで検証し、事前に学習したマルチモーダルモデルの各要素と重要度について徹底的な分析を行い、現状と比較して大幅な改善、特にCIDErで+7.5%、BLEU-4メトリクスで+1.3%の改善を実証する。

要約(オリジナル)

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector’s outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.

arxiv情報

著者	Chia-Wen Kuo,Zsolt Kira
発行日	2022-06-08 02:20:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー