Self-supervised Cross-view Representation Reconstruction for Change Captioning

要約

キャプションの変更は、類似した画像のペアの違いを説明することを目的としています。
その重要な課題は、視点の変更によって引き起こされる擬似的な変化の下で、安定した差分表現をどのように学習するかです。
この論文では、自己教師ありクロスビュー表現再構成 (SORER) ネットワークを提案することでこれに対処します。
具体的には、まず、類似/異なる画像からのクロスビュー特徴間の関係をモデル化するために、マルチヘッドのトークンごとのマッチングを設計します。
次に、2 つの類似した画像のクロスビューのコントラストの位置合わせを最大化することにより、SCORER は 2 つのビューに依存しない画像表現を自己教師付きの方法で学習します。
これらに基づいて、交差注意によって変化しないオブジェクトの表現を再構築し、キャプション生成のための安定した差分表現を学習します。
さらに、キャプションの品質を向上させるためにクロスモーダル後方推論を考案しました。
このモジュールは、「幻覚」表現をキャプションと「前」表現で逆にモデル化します。
これを「後」の表現に近づけることで、キャプションが自己監視された方法で違いについての情報を提供するように強制します。
広範な実験により、私たちの方法が 4 つのデータセットで最先端の結果を達成できることが示されました。
コードは https://github.com/tuyunbin/SCORR で入手できます。

要約(オリジナル)

Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a “hallucination” representation with the caption and “before” representation. By pushing it closer to the “after” representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.

arxiv情報

著者	Yunbin Tu,Liang Li,Li Su,Zheng-Jun Zha,Chenggang Yan,Qingming Huang
発行日	2023-09-28 09:28:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-supervised Cross-view Representation Reconstruction for Change Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー