Detection-based Intermediate Supervision for Visual Question Answering

要約

最近、ニューラルモジュールネットワーク (NMN) は、構成的な視覚的質問、特にマルチホップの視覚的および論理的推論を伴う質問に答えるという点で継続的な成功を収めています。
NMN は、その質問の推論パスからインスタンスモジュールを使用して複雑な質問をいくつかのサブタスクに分解し、中間監視を利用して回答の予測をガイドすることで、推論の解釈可能性を向上させます。
ただし、中間監督の大まかなモデリングにより、そのパフォーマンスが妨げられる可能性があります。
たとえば、(1) 各インスタンスモジュールが 1 つの接地されたオブジェクトのみを参照しているにもかかわらず、潜在的に関連する他の接地されたオブジェクトを見落とし、完全なクロスモーダルアライメント学習を妨げるという事前の仮定。
(2) IoU ベースの中間監視では、境界ボックスのオーバーラップの問題によりモデルの焦点が無関係なオブジェクトに向かう可能性があるため、ノイズ信号が発生する可能性があります。
これらの問題に対処するために、\textbf{\underline{D}}etection-based \textbf{\underline{I}}nintermediate \textbf{\underline{S}}upervision (DIS) という新しい方法が提案されています。
シーケンス生成を通じて複数の接地監視を容易にする生成検出フレームワーク。
そのため、DIS はより包括的かつ正確な中間監視を提供し、それによって解答予測パフォーマンスを向上させます。
さらに、中間結果を考慮することで、DIS は構成的な質問とそのサブ質問への回答の一貫性を高めます。広範な実験により、提案した DIS の優位性が実証され、従来のアプローチと比較して精度の向上と最先端の推論の一貫性の両方が示されています。

要約(オリジナル)

Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model’s focus towards irrelevant objects. To address these issues, a novel method, \textbf{\underline{D}}etection-based \textbf{\underline{I}}ntermediate \textbf{\underline{S}}upervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions.Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.

arxiv情報

著者	Yuhang Liu,Daowan Peng,Wei Wei,Yuanyuan Fu,Wenfeng Xie,Dangyang Chen
発行日	2023-12-26 11:45:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Detection-based Intermediate Supervision for Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー