Vision Transformers with Self-Distilled Registers

要約

ビジョントランス（VIT）は、視覚処理タスクの支配的なアーキテクチャとして浮上しており、トレーニングデータとモデルサイズの増加により優れたスケーラビリティを示しています。
しかし、最近の研究により、地元のセマンティクスと不一致のvitsでのアーティファクトトークンの出現が特定されています。
これらの異常なトークンは、細粒の局在化または構造的一貫性を必要とするタスクでVITパフォーマンスを低下させます。
この問題の効果的な緩和は、トレーニング中に登録用トークンを登録しているため、トレーニング中にアーティファクトの用語を暗黙的に「吸収」することです。
この論文では、さまざまな大規模な事前に訓練されたVITが利用できることを考えると、このような登録トークンをゼロから再訓練する必要なく装備することを目指しています。
具体的には、追加のラベル付きデータと完全な再訓練を必要とせずに、レジスタを既存のVITに統合する効率的な自己抵抗法である事後レジスタ（PH-REG）を提案します。
PH-Regは、同じ事前に訓練されたVITから教師ネットワークと学生ネットワークの両方を初期化します。
教師は凍結されていないままであり、生徒はランダムに初期化されたレジスタトークンで増強されます。
教師の入力にテスト時間の増強を適用することにより、アーティファクトがない密集した密な埋め込みを生成し、ロックされていない学生重量の小さなサブセットのみを最適化するために使用されます。
私たちのアプローチは、アーティファクトトークンの数を効果的に減らし、ゼロショットおよび線形プロービングの下での学生VITのセグメンテーションと深さ予測を改善できることを示しています。

要約(オリジナル)

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly ‘absorb’ the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher’s inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

arxiv情報

著者	Yinjie Chen,Zipeng Yan,Chong Zhou,Bo Dai,Andrew F. Luo
発行日	2025-05-27 17:59:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision Transformers with Self-Distilled Registers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー