Towards Language-guided Visual Recognition via Dynamic Convolutions

要約

この論文では、言語ガイドによる視覚認識の探索を通じて、統合されたエンドツーエンドのマルチモーダルネットワークを確立することに取り組んでいます。
この目標に近づくために、まず、Language-dependent Convolution (LaConv) と呼ばれる新しいマルチモーダル畳み込みモジュールを提案します。
そのコンボリューションカーネルは自然言語情報に基づいて動的に生成され、さまざまなマルチモーダルサンプルの差別化された視覚的特徴を抽出するのに役立ちます。
LaConv モジュールに基づいて、視覚認識とマルチモーダル推論を 1 つの前方構造に統合できる、LaConvNet と呼ばれる初の完全言語駆動型畳み込みネットワークをさらに構築します。
LaConv と LaConvNet を検証するために、視覚的質問応答 (VQA) と指示表現理解 (REC) という 2 つの視覚と言語タスクの 4 つのベンチマークデータセットで広範な実験を実施しました。
実験結果は、既存のマルチモーダルモジュールと比較して LaConv のパフォーマンスが向上していることを示しているだけでなく、コンパクトなネットワーク、高い汎化能力、優れたパフォーマンス (例: RefCOCO+ で +4.7%) など、統合ネットワークとしての LaConvNet のメリットも実証しています。
。

要約(オリジナル)

In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental results not only shows the performance gains of LaConv compared to the existing multi-modal modules, but also witness the merits of LaConvNet as an unified network, including compact network, high generalization ability and excellent performance, e.g., +4.7% on RefCOCO+.

arxiv情報

著者	Gen Luo,Yiyi Zhou,Xiaoshuai Sun,Yongjian Wu,Yue Gao,Rongrong Ji
発行日	2023-09-14 13:37:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Language-guided Visual Recognition via Dynamic Convolutions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー