High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

要約

ますます多くの機械学習シナリオは、ターゲットモデルのトレーニングを監督するためにラベルとしてサロゲートモデルの出力を使用する知識の蒸留に依存しています。
この作業では、2つの設定の下で、リッジレス、高次元回帰のこのプロセスの鋭い特性評価を提供します。（i）サロゲートモデルが任意のモデルシフト、および（ii）分布シフトで、代理モデルは分配不足データによる経験的リスク最小化の解決策です。
どちらの場合も、軽度の条件下でのサンプルサイズとデータ分布の観点から、非亜鉛法の境界を介してターゲットモデルの正確なリスクを特徴付けます。
結果として、最適な代理モデルの形式を特定します。これは、データ依存性のある方法で弱い機能を破棄することの利点と制限を明らかにします。
弱い（W2S）一般化のコンテキストでは、これには（i）Surrogateを弱いモデルとしてのW2Sトレーニングは、同じデータ予算の下で強力なラベルでトレーニングを上回ることができるという解釈がありますが、（ii）データのスケーリング法を改善できません。
リッジレス回帰とニューラルネットワークアーキテクチャの両方での数値実験の結果を検証します。

要約(オリジナル)

A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that (i) W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but (ii) it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

arxiv情報

著者	M. Emrullah Ildiz,Halil Alperen Gozeten,Ege Onur Taga,Marco Mondelli,Samet Oymak
発行日	2025-02-27 18:49:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー