With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

要約

画像のキャプション作成は、視覚と言語を伴う多くのタスクと同様に、現在、画像内のセマンティクスを抽出し、言語的に一貫した説明に変換するために Transformer ベースのアーキテクチャに依存しています。
成功しても、attention オペレーターは現在の入力サンプルの投影の重み付き合計のみを考慮するため、他のサンプルの共同観察から得られる関連する意味情報は無視されます。
この論文では、プロトタイプのメモリモデルを通じて、他のトレーニングサンプルの処理中に取得されたアクティベーションに対してアテンションを実行できるネットワークを考案します。
私たちの記憶は、識別力がありコンパクトなプロトタイプベクトルの定義を通じて、過去のキーと値の分布をモデル化します。
慎重に設計されたベースラインや最先端のアプローチと比較し、提案された各コンポーネントの役割を調査することによって、COCO データセット上で提案されたモデルのパフォーマンスを実験的に評価します。
私たちの提案により、クロスエントロピーのみでトレーニングした場合と自己クリティカルシーケンストレーニングで微調整した場合の両方で、エンコーダー-デコーダートランスフォーマーのパフォーマンスを3.7 CIDErポイント向上させることができることを実証します。
ソースコードとトレーニング済みモデルは、https://github.com/aimagelab/PMA-Net から入手できます。

要約(オリジナル)

Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net.

arxiv情報

著者	Manuele Barraco,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
発行日	2023-08-23 18:53:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー