Modular Duality in Deep Learning

要約

最適化理論の古い考え方では、勾配は二重ベクトルであるため、最初に重みが存在する主空間にマッピングすることなく重みから減算することはできないとされています。
この論文ではこの考えを真剣に受け止め、一般的なニューラルネットワークに対してそのような双対性マップを構築します。
モジュール二重化と呼ばれる私たちのマップは、a) 高速、b) スケーラブルなトレーニングアルゴリズムの統一理論的基礎を形成します。
モジュール二重化では、まず各層のセマンティクスに基づいて演算子ノルムを層に割り当て、次にこれらの層ごとのノルムを使用して完全なニューラルアーキテクチャの重み空間上に二重性マップを再帰的に誘導します。
最後に、Embed、Linear、および Conv2D レイヤーを二重化するための GPU に適したアルゴリズムを導出します。最後の 2 つの方法は、長方形の Newton-Schulz 反復に基づいています (Kovarik, 1970; Bj\’orck & Bowie, 1971)。
NanoGPT のトレーニングの速度記録を樹立するために、私たちの手法の変形が使用されました。
全体として、私たちはモジュール二重性の理論が一般的なニューラルアーキテクチャ向けの高速でスケーラブルな次世代のオプティマイザーを生み出すことを期待しています。

要約(オリジナル)

An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We conclude by deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers — the latter two methods are based on a rectangular Newton-Schulz iteration (Kovarik, 1970; Bj\’orck & Bowie, 1971). A variant of our methods was used to set speed records for training NanoGPT. Overall, we hope that our theory of modular duality will yield a next generation of fast and scalable optimizers for general neural architectures.

arxiv情報

著者	Jeremy Bernstein,Laker Newhouse
発行日	2024-12-06 17:02:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Modular Duality in Deep Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー