How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

要約

言語認識タスクは自然言語処理（NLP）の基本であり、大規模言語モデル（LLM）の性能ベンチマークに広く用いられている。これらのタスクはまた、変換器の動作メカニズムを説明する上でも重要な役割を果たしている。本研究では、「偶数ペア」と「パリティチェック」と呼ばれる、規則的な言語認識のカテゴリーにおける2つの代表的なタスクに焦点を当てる。我々の目標は、注意層と線形層からなる1層変換器が、勾配降下下での学習ダイナミクスを理論的に解析することにより、これらのタスクを解くことをどのように学習するかを探ることである。偶数ペアは1層変換器によって直接解くことができるが、パリティチェックは、偶数ペア課題に対して十分に訓練された変換器の推論段階、あるいは1層変換器の訓練に、Chain-of-Thought（CoT）を統合することによって解く必要がある。どちらの問題に対しても、注意層と線形層の共同訓練は2つの異なる段階を示すことが我々の分析からわかった。第1段階では、注意層は急速に成長し、データ列を分離可能なベクトルにマッピングする。第二段階では、注意層は安定になり、一方線形層は対数的に成長し、注意層の出力を正と負のサンプルに正しく分離する最大マージンの超平面に方向が近づき、損失は$O(1/t)$の割合で減少する。我々の実験はこれらの理論結果を検証する。

要約(オリジナル)

Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as `even pairs’ and `parity check’, the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.

arxiv情報

著者	Ruiquan Huang,Yingbin Liang,Jing Yang
発行日	2025-05-02 00:07:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー