From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

要約

最近、注意ベースの視覚的質問応答（VQA）は、質問を利用して、回答に関連するさまざまな視覚的領域を選択的にターゲットにすることにより、大きな成功を収めています。
既存の視覚的注意モデルは一般に平面です。つまり、画像の最後のconv-layer特徴マップの異なるチャネルが同じ重みを共有します。
CNNの機能は自然に空間的でチャネルごとであるため、これは注意メカニズムと矛盾します。
また、視覚的注意モデルは通常、ピクセルレベルで実行されるため、領域の不連続な問題が発生する可能性があります。
この論文では、VQAタスクを改善するためにオブジェクト領域に新しいチャネルと空間的注意をうまく適用することにより、キュービック視覚注意（CVA）モデルを提案します。
具体的には、ピクセルに注意を払う代わりに、最初にオブジェクト提案ネットワークを利用して、オブジェクト候補のセットを生成し、それらに関連するconv特徴を抽出します。
次に、この質問を利用して、コンレイヤー特徴マップに基づくチャネル注意と空間注意の計算をガイドします。
最後に、出席した視覚的機能と質問を組み合わせて、答えを推測します。
COCO-QA、VQA、Visual7Wを含む3つのパブリックイメージQAデータセットで提案されたCVAのパフォーマンスを評価します。
実験結果は、提案された方法が最先端技術を大幅に上回っていることを示しています。

要約(オリジナル)

Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problems. In this paper, we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.

arxiv情報

著者	Jingkuan Song,Pengpeng Zeng,Lianli Gao,Heng Tao Shen
発行日	2022-06-04 07:03:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー