Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

要約

近年、アテンションベースのトランスフォーマーは、自然言語を含むさまざまな分野で大きな成功を収めています。
彼らの成功の背後にある重要な要素は、生成的な事前トレーニング手順です。この手順では、これらのモデルは自己回帰的な方法で大規模なテキストコーパス上でトレーニングされます。
この現象を明らかにするために、マルコフ連鎖のレンズを通して変圧器の逐次モデリング機能を理論と体系的実験の両方で研究できる新しいフレームワークを提案します。
自然言語のマルコフ性からインスピレーションを得て、データをマルコフソースとしてモデル化し、このフレームワークを利用して、データの分布特性、トランスフォーマーアーキテクチャ、学習された分布、および最終モデルのパフォーマンス間の相互作用を系統的に研究します。
特に、単層変圧器の損失状況を理論的に特徴付け、特定のデータ特性と変圧器アーキテクチャに依存する大域的最小値と悪い局所的最小値の存在を示します。
実験に裏付けられ、理論的発見が経験的結果と一致することを実証します。
私たちはこれらの発見を高次マルコフ連鎖とより深いアーキテクチャのより広範な文脈でさらに調査し、この分野での未解決の問題の概要を示します。
コードは \url{https://github.com/Bond1995/Markov} で入手できます。

要約(オリジナル)

In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. A key ingredient behind their success is the generative pretraining procedure, during which these models are trained on a large text corpus in an auto-regressive manner. To shed light on this phenomenon, we propose a new framework that allows both theory and systematic experiments to study the sequential modeling capabilities of transformers through the lens of Markov chains. Inspired by the Markovianity of natural languages, we model the data as a Markovian source and utilize this framework to systematically study the interplay between the data-distributional properties, the transformer architecture, the learnt distribution, and the final model performance. In particular, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima and bad local minima contingent upon the specific data characteristics and the transformer architecture. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. We further investigate these findings in the broader context of higher order Markov chains and deeper architectures, and outline open problems in this arena. Code is available at \url{https://github.com/Bond1995/Markov}.

arxiv情報

著者	Ashok Vardhan Makkuva,Marco Bondaschi,Adway Girish,Alliot Nagle,Martin Jaggi,Hyeji Kim,Michael Gastpar
発行日	2024-02-06 17:18:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー