Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

要約

Vision Transformer(ViT)は、一般的な領域での大規模な事前学習により目覚ましい成功を収めているが、学習データが乏しい下流の遠距離の領域に適用する際には、まだ課題に直面している。すなわち、ViTにおいて画像トークンの連続性を乱す（すなわち、画素がパッチ間をスムーズに移動しないようにする）と、一般（ソース）ドメインでは顕著な性能低下をもたらすが、下流のターゲットドメインではわずかな性能低下しかもたらさない。このことは、大きなドメインギャップの下でのViTの汎化における画像トークンの連続性の役割に疑問を投げかけるものである。本論文では、この現象を掘り下げて解釈する。その結果、連続性はViTがより大きな空間パターンを学習する際に役立つことがわかった。一方、極端な領域間ギャップでは、各パッチ内のより小さなパターンしか伝達されないことが示唆される。この解釈に基づき、我々はさらに、画像トークンの連続性をより良く破壊し、モデルが大きなパターンに依存せず、より小さなパターンに依存するように促す、CDFSLのためのシンプルかつ効果的な手法を提案する。広範な実験により、ドメインギャップを減少させ、最先端技術を凌駕する我々の手法の有効性を示す。コードとモデルはhttps://github.com/shuaiyi308/ReCIT。

要約(オリジナル)

Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention’s insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens’ continuity in ViT’s generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at https://github.com/shuaiyi308/ReCIT.

arxiv情報

著者	Shuai Yi,Yixiong Zou,Yuhua Li,Ruixuan Li
発行日	2025-06-03 17:40:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー