Retrieval-Augmented Transformer for Image Captioning

要約

画像キャプションモデルは、入力画像の自然言語による説明を提供することで、視覚と言語を結び付けることを目的としています。
過去数年間、パラメトリックモデルを学習し、視覚的特徴抽出の進歩を提案するか、より優れたマルチモーダル接続をモデル化することで、このタスクに取り組んできました。
この論文では、kNNメモリを使用した画像キャプションアプローチの開発を調査します。これにより、生成プロセスを支援するために外部コーパスから知識を取得できます。
私たちのアーキテクチャは、視覚的類似性に基づくナレッジリトリーバー、微分可能なエンコーダー、kNN 拡張アテンションレイヤーを組み合わせて、過去のコンテキストと外部メモリから取得したテキストに基づいてトークンを予測します。
COCO データセットで実施された実験結果は、明示的な外部メモリを使用すると、生成プロセスが支援され、キャプションの品質が向上することを示しています。
私たちの仕事は、大規模な画像キャプションモデルを改善するための新しい道を開きます.

要約(オリジナル)

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

arxiv情報

著者	Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
発行日	2022-08-22 07:52:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Retrieval-Augmented Transformer for Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー