Focusing On Targets For Improving Weakly Supervised Visual Grounding

要約

弱教師付きビジュアルグラウンディングは、特定の言語クエリに対応する画像内の領域を予測することを目的としています。この場合、ターゲットオブジェクトとクエリの間のマッピングはトレーニング段階では不明です。
最先端の方法では、ビジョン言語の事前トレーニングモデルを使用して Grad-CAM からヒートマップを取得し、すべてのクエリワードを画像領域と照合し、組み合わせたヒートマップを使用して領域の提案をランク付けします。
この論文では、このアプローチを改善するための 2 つのシンプルだが効率的な方法を提案します。
まず、モデルがオブジェクトレベルとシーンレベルの両方のセマンティック表現を学習することを促進するために、ターゲットを意識したクロッピングアプローチを提案します。
次に、依存関係解析を適用して対象オブジェクトに関連する単語を抽出し、ヒートマップの組み合わせでこれらの単語に重点を置きます。
私たちの方法は、RefCOCO、RefCOCO+、およびRefCOCOgの以前のSOTA方法を大幅に上回っています。

要約(オリジナル)

Weakly supervised visual grounding aims to predict the region in an image that corresponds to a specific linguistic query, where the mapping between the target object and query is unknown in the training stage. The state-of-the-art method uses a vision language pre-training model to acquire heatmaps from Grad-CAM, which matches every query word with an image region, and uses the combined heatmap to rank the region proposals. In this paper, we propose two simple but efficient methods for improving this approach. First, we propose a target-aware cropping approach to encourage the model to learn both object and scene level semantic representations. Second, we apply dependency parsing to extract words related to the target object, and then put emphasis on these words in the heatmap combination. Our method surpasses the previous SOTA methods on RefCOCO, RefCOCO+, and RefCOCOg by a notable margin.

arxiv情報

著者	Viet-Quoc Pham,Nao Mishima
発行日	2023-02-22 10:02:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Focusing On Targets For Improving Weakly Supervised Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー