Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

要約

タイトル：画像キャプションのための視覚言語ニューラルモジュールの学習

要約：
– 人間は文を「sth do sth at someplace」のような異なる部分に分解し、それぞれの部分に特定の内容を与える傾向がある。
– この影響を受けて、私たちは「モジュール設計の原則」に従って画像キャプショナーを提案する：Collocate Visual-Linguistic Neural Modules (CVLNM)を学習する。
– VQAで広く使用されている神経モジュールネットワークとは異なり、CVLNMのタスクはより困難であるため、ここでは言語は部分的にしか観測できず、イメージキャプションのプロセス中にモジュールを動的に配置する必要がある。
– CVLNMの訓練を設計するために、以下の技術的貢献を行う：1) 識別可能なモジュール設計 – エンコーダ内の4つのモジュール（機能語用の言語モジュールと異なる内容語用の3つの視覚モジュール[名詞、形容詞、動詞]、およびデコーダ用の共通感覚推論用の言語モジュール）、2）視覚推論を強固にする自己注意ベースのモジュールコントローラ、3）モジュールコントローラ構文基準に基づく「構文損失」を設けて、さらにCVLNMのトレーニングを正規化する。
– MS-COCOデータセット上の広範な実験により、CVLNMはより効果的であり、新しい最高129.5 CIDEr-Dを達成し、データセットのバイアスにオーバーフィットする可能性が少なく、トレーニングサンプルが少ない場合にも苦労しないことが示された。
– コードは、\ url {https://github.com/GCYZSL/CVLMN}で利用可能である。

要約(オリジナル)

Humans tend to decompose a sentence into different parts like \textsc{sth do sth at someplace} and then fill each part with certain content. Inspired by this, we follow the \textit{principle of modular design} to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the \re{widely used} neural module networks in VQA, where the language (\ie, question) is fully observable, \re{the task of collocating visual-linguistic modules is more challenging.} This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning. To sum up, we make the following technical contributions to design and train our CVLNM: 1) \textit{distinguishable module design} — \re{four modules in the encoder} including one linguistic module for function words and three visual modules for different content words (\ie, noun, adjective, and verb) and another linguistic one in the decoder for commonsense reasoning, 2) a self-attention based \textit{module controller} for robustifying the visual reasoning, 3) a part-of-speech based \textit{syntax loss} imposed on the module controller for further regularizing the training of our CVLNM. Extensive experiments on the MS-COCO dataset show that our CVLNM is more effective, \eg, achieving a new state-of-the-art 129.5 CIDEr-D, and more robust, \eg, being less likely to overfit to dataset bias and suffering less when fewer training samples are available. Codes are available at \url{https://github.com/GCYZSL/CVLMN}

arxiv情報

著者	Xu Yang,Hanwang Zhang,Chongyang Gao,Jianfei Cai
発行日	2023-04-24 02:27:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー