Quantifying Knowledge Distillation Using Partial Information Decomposition

要約

知識蒸留は、複雑な教師モデルの内部表現をエミュレートするために、より小さな生徒モデルを訓練することによって、リソースに制約のある環境で複雑な機械学習モデルを展開する。しかし、教師の表現には、下流のタスクに関係のない厄介な情報や付加的な情報もエンコードされている可能性があります。このような無関係な情報を取り除くことは、容量に制限のある生徒モデルのパフォーマンスを実際に妨げる可能性があります。この観察は、我々の主要な質問の動機となる：知識抽出の情報理論的限界とは何か？この目的を達成するために、我々は部分情報分解を利用して、下流のタスクのために移転された知識と蒸留するために残された知識を定量化し、説明する。我々は、タスクに関連した移転知識は、教師と生徒の間のタスクに関する冗長情報の尺度によって簡潔に捉えられることを理論的に実証する。我々は、冗長情報を正則化する新しいマルチレベル最適化を提案し、冗長情報蒸留(RID)のフレームワークに導く。RIDは、単に生徒と教師の表現を整合させるのではなく、タスクに関連する知識を簡潔に定量化するため、厄介な教師が存在する状況において、より弾力的で効果的な蒸留を導く。

要約(オリジナル)

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher’s representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

arxiv情報

著者	Pasan Dissanayake,Faisal Hamman,Barproda Halder,Ilia Sucholutsky,Qiuyi Zhang,Sanghamitra Dutta
発行日	2025-04-04 16:08:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Quantifying Knowledge Distillation Using Partial Information Decomposition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー