Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning

要約

視覚関係を理解する人間の能力は、特に以前に見えなかったオブジェクトの場合、AIシステムの能力よりもはるかに優れています。
たとえば、AIシステムは、そのような2つのオブジェクトが視覚的に同じか異なるかを判断するのに苦労していますが、人間は簡単にそうすることができます。
アクティブなビジョン理論は、視覚関係の学習は、目を動かすことでオブジェクトとその部分を固定するために取る行動に基づいていると仮定しています。
特に、対応する眼の動きに関する低次元空間情報は、異なる画像部分間の関係の表現を促進するために仮定されています。
これらの理論に触発されて、私たちは、入力イメージの最も顕著な領域で順次垣間見し、それらを高解像度で処理する、斬新な垣間見たアクティブな知覚（GAP）を備えたシステムを開発します。
重要なことに、私たちのシステムは、画像のさまざまな部分間の関係を表すために、視覚的なコンテンツとともに、垣間見るアクションに起因する場所を活用しています。
結果は、即時の視覚コンテンツを超える視覚関係を抽出するためにギャップが不可欠であることを示唆しています。
私たちのアプローチは、いくつかの視覚的推論タスクがよりサンプル効率が高く、以前のモデルよりも分散型視覚入力に対してより良い一般化で最先端のパフォーマンスに到達します。

要約(オリジナル)

Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.

arxiv情報

著者	Oleh Kolner,Thomas Ortner,Stanisław Woźniak,Angeliki Pantazi
発行日	2025-04-01 14:43:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー