Inherently Faithful Attention Maps for Vision Transformers

要約

学習したバイナリ注意マスクを使用して、参加した画像領域のみが予測に影響を与えることを保証する注意ベースの方法を紹介します。
コンテキストは、オブジェクトの知覚に強く影響し、特にオブジェクトが分散式の背景に表示される場合、偏った表現につながる場合があります。
同時に、多くの画像レベルのオブジェクト中心のタスクには、関連する領域を特定する必要があり、多くの場合コンテキストが必要です。
この難問に対処するために、2段階のフレームワークを提案します。ステージ1は完全な画像を処理してオブジェクトの部分を発見し、タスク関連領域を特定します。ステージ2は、注意マスキングをレバレッジして、これらの領域に受容フィールドを制限し、潜在的に偽りの情報をフィルタリングしながら集中的な分析を可能にします。
両方の段階が共同でトレーニングされているため、ステージ2がステージ1を改良します。さまざまなベンチマーク全体の広範な実験は、私たちのアプローチが偽の相関と分散型の背景に対する堅牢性を大幅に改善することを示しています。

要約(オリジナル)

We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

arxiv情報

著者	Ananthu Aniraj,Cassio F. Dantas,Dino Ienco,Diego Marcos
発行日	2025-06-10 15:41:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inherently Faithful Attention Maps for Vision Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー