Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization

要約

Vision-Language Models（VLMS）は、最小限のラベルデータを使用して豊富なテキスト情報を活用することにより、多様なタスク全体で顕著な成功を収めています。
ただし、特にリソースに制約のある環境では、このような大きなモデルを展開することは依然として困難です。
知識蒸留（KD）は、この問題に対する確立された解決策を提供します。
ただし、VLMSからの最近のKDアプローチには、多くの場合、マルチステージトレーニングまたは追加のチューニングが含まれ、計算オーバーヘッドと最適化の複雑さが増加します。
この論文では、$ \ mathbf {\ texttt {d}} $ ual-$ \ mathbf {\ texttt {h}} $ ead $ \ mathbf {\ texttt {o}} $ ptimization（$ \ mathbf {\ textt {\ dho {dho wcred} a $ frame）} $ frame）を提案します。
VLMSから、半監視設定のコンパクトなタスク固有のモデルまで。
具体的には、ラベル付きのデータと教師の予測から独立して学習し、推論中に出力を直線的に結合することを提案するデュアル予測ヘッドを導入します。
$ \ texttt {dho} $は、監視された信号と蒸留信号の間の勾配的な競合を軽減し、シングルヘッドKDベースラインよりも効果的な機能学習を可能にすることを観察します。
その結果、広範な実験では、$ \ texttt {dho} $が、複数のドメインと細粒データセットのベースラインを一貫して上回ることが示されています。
特に、Imagenetでは、最先端のパフォーマンスを達成し、1％と10％のラベル付きデータでそれぞれ3％と0.1％を改善し、パラメーターを使用します。

要約(オリジナル)

Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning, increasing computational overhead and optimization complexity. In this paper, we propose $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization ($\mathbf{\texttt{DHO}}$) — a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-supervised settings. Specifically, we introduce dual prediction heads that independently learn from labeled data and teacher predictions, and propose to linearly combine their outputs during inference. We observe that $\texttt{DHO}$ mitigates gradient conflicts between supervised and distillation signals, enabling more effective feature learning than single-head KD baselines. As a result, extensive experiments show that $\texttt{DHO}$ consistently outperforms baselines across multiple domains and fine-grained datasets. Notably, on ImageNet, it achieves state-of-the-art performance, improving accuracy by 3% and 0.1% with 1% and 10% labeled data, respectively, while using fewer parameters.

arxiv情報

著者	Seongjae Kang,Dong Bok Lee,Hyungjoon Jang,Sung Ju Hwang
発行日	2025-05-12 15:39:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー