Vision Transformers Don’t Need Trained Registers

要約

視覚変圧器における以前に特定された現象の根底にあるメカニズムを調査します。これは、騒々しい注意マップにつながるハイノームトークンの出現です。
複数のモデル（たとえば、Clip、Dinov2）では、ニューロンのまばらなセットが、外れ値トークンにハイノーム活性化を集中させ、不規則な注意パターンを引き起こし、下流の視覚処理を分解することを観察します。
これらの外れ値を削除するための既存のソリューションには、追加の学習されたレジスタトークンでモデルをゼロから再試行することが含まれますが、調査結果を使用して、これらのアーティファクトを緩和するためのトレーニングなしのアプローチを作成します。
発見されたレジスタニューロンからのハイノームの活性化を追加の訓練を受けていないトークンにシフトすることにより、レジスタなしで既に訓練されたモデルに対するレジスタトークンの効果を模倣できます。
私たちの方法は、よりクリーンな注意と機能マップを生み出し、複数の下流の視覚タスクにわたってベースモデル上のパフォーマンスを向上させ、レジスタトークンで明示的にトレーニングされたモデルに匹敵する結果を達成することを実証します。
次に、テスト時間レジスタを既製のビジョン言語モデルに拡張して、解釈可能性を向上させます。
我々の結果は、テスト時間レジスタがテスト時間にレジスタトークンの役割を効果的に取り、それらなしでリリースされた事前に訓練されたモデルにトレーニングなしのソリューションを提供することを示唆しています。

要約(オリジナル)

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers — the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

arxiv情報

著者	Nick Jiang,Amil Dravid,Alexei Efros,Yossi Gandelsman
発行日	2025-06-18 16:30:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision Transformers Don’t Need Trained Registers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー