Improved Visual Grounding through Self-Consistent Explanations

要約

画像とテキストを照合するようにトレーニングされた視覚と言語のモデルを視覚的な説明方法と組み合わせて、画像内の特定のオブジェクトの位置を示すことができます。
私たちの研究は、これらのモデルの位置特定 (「グラウンディング」) 能力が、一貫性のある視覚的な説明を微調整することによってさらに改善できることを示しています。
我々は、大規模な言語モデルを使用した言い換えで既存のテキスト画像データセットを強化する戦略と、自己一貫性を促進する言い換えの視覚的説明マップ上の弱教師付き戦略である SelfEQ を提案します。
具体的には、入力テキストフレーズに対して、言い換えを生成し、フレーズと言い換えが画像内の同じ領域にマッピングされるようにモデルを微調整しようとします。
これにより、モデルが処理できる語彙が拡張され、勾配ベースの視覚的説明手法 (GradCAM など) によって強調表示されるオブジェクトの位置の品質が向上すると考えられます。
私たちは、SelfEQ が強力なベースライン手法やいくつかの以前の研究に比べて、Flickr30k、ReferIt、および RefCOCO+ のパフォーマンスを向上させることを実証します。
特に、いかなるタイプのボックス注釈も使用しない他の方法と比較すると、Flickr30k では 84.07% (絶対的な改善 4.69%)、ReferIt では 67.40% (絶対的な改善 7.68%)、そして 75.10%、55.49% を獲得しています。
RefCOCO+ テストセット A と B でそれぞれ向上しました (平均 3.74% の絶対的な改善)。

要約(オリジナル)

Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization –‘grounding’– abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model, and SelfEQ, a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically, for an input textual phrase, we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle, and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k, ReferIt, and RefCOCO+ over a strong baseline method and several prior works. Particularly, comparing to other methods that do not use any type of box annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%), 67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average).

arxiv情報

著者	Ruozhen He,Paola Cascante-Bonilla,Ziyan Yang,Alexander C. Berg,Vicente Ordonez
発行日	2023-12-07 18:59:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improved Visual Grounding through Self-Consistent Explanations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー