Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

要約

リソースに制約のあるシナリオの下流タスクに BERT のような事前トレーニング済みの変換モデルをデプロイすることは、入力シーケンスの長さに応じて急速に増大する推論コストが高いため、困難です。
この研究では、制約を認識し、ランキングを抽出したトークンプルーニング手法 ToP を提案します。これは、入力シーケンスがレイヤーを通過するときに不要なトークンを選択的に削除し、モデルが精度を維持しながらオンライン推論速度を向上できるようにします。
ToP は、ランク付け蒸留トークン蒸留技術を通じて、従来のセルフアテンションメカニズムにおける不正確なトークン重要度ランキングの制限を克服します。この技術により、枝刈りされていないモデルの最終層から枝刈りされたモデルの初期層まで効果的なトークンランキングが抽出されます。
次に、ToP は、トランスフォーマー層の最適なサブセットを自動的に選択し、改善された $L_0$ 正則化を通じてこれらの層内のトークンプルーニングの決定を最適化する、粗いものから細かいものへの枝刈りアプローチを導入します。
GLUE ベンチマークと SQuAD タスクに関する広範な実験により、ToP が最先端のトークンプルーニングとモデル圧縮方法を上回り、精度と速度が向上することが実証されました。
ToP は、GLUE で競争力のある精度を達成しながら BERT の平均 FLOP を 8.1 倍削減し、Intel CPU で最大 7.4 倍の実質レイテンシの高速化を実現します。

要約(オリジナル)

Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a constraint-aware and ranking-distilled token pruning method ToP, which selectively removes unnecessary tokens as input sequence passes through layers, allowing the model to improve online inference speed while preserving accuracy. ToP overcomes the limitation of inaccurate token importance ranking in the conventional self-attention mechanism through a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. Then, ToP introduces a coarse-to-fine pruning approach that automatically selects the optimal subset of transformer layers and optimizes token pruning decisions within these layers through improved $L_0$ regularization. Extensive experiments on GLUE benchmark and SQuAD tasks demonstrate that ToP outperforms state-of-the-art token pruning and model compression methods with improved accuracy and speedups. ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.

arxiv情報

著者	Junyan Li,Li Lyna Zhang,Jiahang Xu,Yujing Wang,Shaoguang Yan,Yunqing Xia,Yuqing Yang,Ting Cao,Hao Sun,Weiwei Deng,Qi Zhang,Mao Yang
発行日	2023-06-26 03:06:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー