TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

要約

因果言語モデルは顕著な能力を実証していますが、そのサイズは、リソースに制約のある環境での展開に大きな課題をもたらします。
知識の蒸留は、大規模な教師モデルから小規模な学生モデルに知識を転送するための広く使用されている手法であり、モデル圧縮の有望なアプローチを提示します。
残りの重要な問題は、教師モデルと学生モデルの主な違い、つまり、蒸留中に障壁をもたらす実質的な容量ギャップ、モード平均化、モード崩壊にあります。
これらの問題に対処するために、$ \ textit {時間的に適応的な補間蒸留（TAID）} $を紹介します。これは、教師の分布に向けて生徒の初期分布から徐々にシフトする適応的な中間分布を通じて生徒と教師の分布を動的に補間する新しい知識蒸留アプローチを導入します。
。
モードの崩壊を防ぎ、モードの平均化とモードの崩壊のバランスをとりながら容量のギャップに対処する際の有効性を経験的に示すTAIDの能力を示す理論分析を提供します。
当社の包括的な実験では、さまざまなモデルのサイズとアーキテクチャにわたるTaidの優れたパフォーマンスが、命令の調整とトレーニング前のシナリオの両方で優れたパフォーマンスを示しています。
さらに、2つの最先端のコンパクトファンデーションモデルを開発することにより、Taidの実用的な影響を紹介します：$ \ texttt {taid-llm-1.5b}言語タスク用の$ \ texttt {taid-vlm-2b} $ for vision
– 言語タスク。
これらの結果は、高性能で効率的なモデルを作成し、よりアクセスしやすいAIテクノロジーの開発を進めることにおけるTaidの有効性を示しています。

要約(オリジナル)

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce $\textit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student’s initial distribution towards the teacher’s distribution. We provide a theoretical analysis demonstrating TAID’s ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID’s superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID’s practical impact by developing two state-of-the-art compact foundation models: $\texttt{TAID-LLM-1.5B}$ for language tasks and $\texttt{TAID-VLM-2B}$ for vision-language tasks. These results demonstrate TAID’s effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

arxiv情報

著者	Makoto Shing,Kou Misaki,Han Bao,Sho Yokoi,Takuya Akiba
発行日	2025-01-29 05:51:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー