GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

要約

知識の蒸留は、推論コストとメモリフットプリントを削減するためにニューラルネットワークを圧縮するために一般的に使用されます。
ただし、生成言語モデル (LM) などの自己回帰モデルの現在の抽出方法には、次の 2 つの重要な問題があります。(1) トレーニング中の出力シーケンスと、その展開中に学習者が生成したシーケンスとの間の分布の不一致。(2)
）モデルが仕様不足であり、学生モデルが教師の分布に適合するほど十分な表現力を持たない可能性があります。
これらの問題に対処するために、私たちは一般化知識蒸留 (GKD) を提案します。
GKD は、トレーニング中に生徒からの出力シーケンスをサンプリングすることで、分布の不一致を軽減します。
さらに、GKD は、教師の分布の下にある可能性が高い生徒からサンプルを生成することに焦点を当てた、逆 KL などの代替発散を最適化することにより、仕様不足のモデルを処理します。
GKD は、要約、機械翻訳、および算術推論のタスクにおいて、LLM を抽出するために一般的に使用されるアプローチよりも優れていることを示します。

要約(オリジナル)

Knowledge distillation is commonly used for compressing neural networks to reduce their inference cost and memory footprint. However, current distillation methods for auto-regressive models, such as generative language models (LMs), suffer from two key issues: (1) distribution mismatch between output sequences during training and the sequences generated by the student during its deployment, and (2) model under-specification, where the student model may not be expressive enough to fit the teacher’s distribution. To address these issues, we propose Generalized Knowledge Distillation (GKD). GKD mitigates distribution mismatch by sampling output sequences from the student during training. Furthermore, GKD handles model under-specification by optimizing alternative divergences, such as reverse KL, that focus on generating samples from the student that are likely under the teacher’s distribution. We demonstrate that GKD outperforms commonly-used approaches for distilling LLMs on summarization, machine translation, and arithmetic reasoning tasks.

arxiv情報

著者	Rishabh Agarwal,Nino Vieillard,Piotr Stanczyk,Sabela Ramos,Matthieu Geist,Olivier Bachem
発行日	2023-06-23 17:56:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー