Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

要約

Transformer ベースの言語モデル (LM) は、すべての層で入力の隠れた表現を作成しますが、予測には最終層の表現のみを使用します。
これにより、モデルの内部意思決定プロセスとその中間表現の有用性がわかりにくくなります。
これを解明する 1 つの方法は、中間の変換計算をバイパスして、隠れた表現を最終的な表現としてキャストすることです。
この作業では、線形変換を使用して、そのようなキャストの簡単な方法を提案します。
最終層の空間内のすべての層からの隠れた表現を検査する一般的な方法よりも、私たちのアプローチがより正確な近似を生成することを示します。
さらに、言語モデリングのコンテキストでは、この方法により、GPT-2 と BERT の初期レイヤー表現を「覗く」ことができ、多くの場合、LM が初期レイヤーで最終出力を既に予測していることを示しています。
次に、最近の早期終了戦略に対する私たちの方法の実用性を実証し、たとえば、95% の精度の維持を目指す場合、私たちのアプローチは、GPT-2 では 7.9% のレイヤーを、BERT では 5.4% のレイヤーをさらに節約できることを示します。
元のアプローチの節約。
最後に、この方法を拡張してサブモジュールを線形近似し、注意がこの変化に対して最も寛容であることを発見しました。

要約(オリジナル)

Transformer-based language models (LMs) create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, by using linear transformations. We show that our approach produces more accurate approximations than the prevailing practice of inspecting hidden representations from all layers in the space of the final layer. Moreover, in the context of language modeling, our method allows ‘peeking’ into early layer representations of GPT-2 and BERT, showing that often LMs already predict the final output in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95% accuracy, our approach saves additional 7.9% layers for GPT-2 and 5.4% layers for BERT, on top of the savings of the original approach. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change.

arxiv情報

著者	Alexander Yom Din,Taelin Karidi,Leshem Choshen,Mor Geva
発行日	2023-03-16 16:10:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー