Learning to (Learn at Test Time): RNNs with Expressive Hidden States

要約

自己アテンションは長い文脈で優れた性能を発揮するが、複雑さは2次関数的である。既存のRNN層は線形複雑度を持つが、長い文脈での性能は隠れ状態の表現力によって制限される。我々は、線形複雑度と表現力のある隠れ状態を持つ、新しいクラスのシーケンスモデリング層を提案する。重要なアイデアは、隠れ状態を機械学習モデルそのものとし、更新ルールを自己教師あり学習のステップとすることである。隠れ状態はテストシーケンスでも学習により更新されるため、我々のレイヤーはTTT（Test-Time Training）レイヤーと呼ばれる。我々は2つのインスタンスを考える：TTT-LinearとTTT-MLPであり、それぞれ隠れ状態が線形モデルと2層MLPである。125Mから1.3Bのパラメータスケールで、強力なTransformerと最新のRNNであるMambaと比較し、我々のインスタンスを評価する。TTT-LinearとTTT-MLPはどちらもベースラインと同等かそれ以上である。Transformerと同様に、より多くのトークンを条件とすることで当惑度を低減し続けることができるが、Mambaは16kコンテキスト以降は不可能である。予備的なシステムの最適化により、TTT-Linearは8kコンテキストの時点で既にTransformerよりも高速であり、ウォールクロック時間ではMambaに匹敵する。TTT-MLPはまだメモリI/Oの課題に直面しているが、長いコンテキストではより大きな可能性を示しており、今後の研究の有望な方向性を示している。

要約(オリジナル)

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

arxiv情報

著者	Yu Sun,Xinhao Li,Karan Dalal,Jiarui Xu,Arjun Vikram,Genghan Zhang,Yann Dubois,Xinlei Chen,Xiaolong Wang,Sanmi Koyejo,Tatsunori Hashimoto,Carlos Guestrin
発行日	2024-07-05 16:23:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー