Transforming Visual Scene Graphs to Image Captions

要約

タイトル：ビジュアルシーングラフをイメージキャプションに変換する方法

要約：
– ビジュアルシーングラフをより具体的なキャプションに変換する提案を行う。
– グラフニューラルネットワーク（GNN）の埋め込みには、マルチヘッドアテンション（MHA）を適用する。
– ビジュアルシーングラフを埋め込んだ後、異なるグラフ埋め込みには様々な特定知識が含まれており、それに基づいて、MHAに基づくMixture-of-Expert（MOE）ベースのデコーダーを設計する。
– MOEベースのデコーダーにより、グラフ埋め込みを区別してさまざまな種類の単語を生成できる。
– 编码器とデコーダーは、どちらもMHAに基づいて構築された均質であるため、以前の異種パイプライン（通常、完全に接続されたGNNとLSTMベースのデコーダーを適用する）とは異なる。
– 均質なアーキテクチャにより、異種パイプラインのように異なるサブネットワークに対し異なるトレーニング戦略を指定する必要がなく、トレーニング設定を統一することができる。
– MS-COCOキャプションベンチマークでの詳細な実験により、提案手法の有効性が確認された。
– コードは以下のURLにあります：https://anonymous.4open.science/r/ACL23_TSG。

要約(オリジナル)

We propose to Transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TSG. The code is in: https://anonymous.4open.science/r/ACL23_TSG.

arxiv情報

著者	Xu Yang,Jiawei Peng,Zihua Wang,Haiyang Xu,Qinghao Ye,Chenliang Li,Ming Yan,Fei Huang,Zhangzikang Li,Yu Zhang
発行日	2023-05-04 01:21:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Transforming Visual Scene Graphs to Image Captions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー