How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression


この研究では、スパース線形回帰問題を考慮し、訓練されたマルチヘッド トランスフォーマーがどのようにインコンテキスト学習を実行するかを調査します。
私たちは実験的に、マルチヘッドの利用が層全体で異なるパターンを示すことを発見しました。最初の層では複数のヘッドが利用され不可欠ですが、後続の層では通常 1 つのヘッドだけで十分です。
この観察について理論的に説明します。最初の層はコンテキスト データを前処理し、後続の層は前処理されたコンテキストに基づいて単純な最適化ステップを実行します。


Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers. We provide a theoretical explanation for this observation: the first layer preprocesses the context data, and the following layers execute simple optimization steps based on the preprocessed context. Moreover, we demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms. Further experimental results support our explanations. Our findings offer insights into the benefits of multi-head attention and contribute to understanding the more intricate mechanisms hidden within trained transformers.


著者 Xingwu Chen,Lei Zhao,Difan Zou
発行日 2024-08-08 15:33:02+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.LG パーマリンク