Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

要約

大規模なビジョン言語モデルは優れたパフォーマンスを達成していますが、そのサイズと計算要件により、リソースに制約のあるデバイスや時間に制約のあるタスクへの展開は現実的ではありません。
モデルの蒸留は、大規模なモデルのパフォーマンスを維持しながら、より小型で高速なモデルを作成するプロセスであり、ソリューションに向けた有望な方向性です。
この論文では、小規模または中規模のデータセットを使用して、大規模な教師の視覚言語モデルの視覚表現を軽量の学生モデルに蒸留する方法を調査します。
特に、この研究は、これまでのモデル蒸留文献では見落とされてきた困難な問題である、オープン語彙の配布外 (OOD) 一般化に焦点を当てています。
私たちは、生徒の OOD の一般化を高めるために、視覚と言語のモダリティの観点から 2 つの原則を提案します。(1) 教師の視覚表現空間をよりよく模倣し、教師との視覚と言語の整合性を注意深く促進することによって。
(2) 異なるラベルを効果的に区別するために、教師の言語表現を有益で詳細な意味属性で強化することによって。
私たちはいくつかの指標を提案し、その手法を調査するために広範な実験を実施します。
この結果は、オープン語彙の分布外分類におけるゼロショットおよび少数ショットの生徒の成績が大幅に向上していることを示しており、私たちが提案したアプローチの有効性が強調されています。
コードは https://github.com/xuanlinli17/large_vlm_distillation_ood でリリースされました

要約(オリジナル)

Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student’s OOD generalization: (1) by better imitating teacher’s visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher’s language representations with informative and finegrained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. Code released at https://github.com/xuanlinli17/large_vlm_distillation_ood

arxiv情報

著者	Xuanlin Li,Yunhao Fang,Minghua Liu,Zhan Ling,Zhuowen Tu,Hao Su
発行日	2023-07-19 01:28:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー