MLPs at the EOC: Concentration of the NTK

要約

ニューラルタンジェントカーネル (NTK) $K_\theta の濃度を調べます: \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l
$l$ 層の多層パーセプトロン (MLP) の }$ $N : \mathbb{R}^{m_0} \times \Theta
\to \mathbb{R}^{m_l}$ には、パラメーター $ を使用して、一部の $a,b \in \mathbb{R}$ に対して活性化関数 $\phi(s) = a s + b \vert s \vert$ が装備されています
\theta \in \Theta$ は Edge Of Chaos (EOC) で初期化されています。
無限の幅の限界で漸近的に成り立つことだけが示されている勾配の独立性の仮定に依存せずに、勾配の独立性の近似バージョンが有限の幅で成り立つことを証明します。
データセット $\{x_1,\cdots,x_n\} \subset に対する $i_1,i_2 \in [1:n]$ の NTK エントリ $K_\theta(x_{i_1},x_{i_2})$ が存在することを示しています
\mathbb{R}^{m_0}$ は最大不等式によって同時に集中するため、NTK 行列 $K(\theta) = であることを証明します。
[\frac{1}{n} K_\theta(x_{i_1},x_{i_2}) : i_1,i_2 \in [1:n]] \in \mathbb{R}^{nm_l \times nm_l}$
無限に広い限界付近に集中 $\overset{\scriptscriptstyle\infty}{K} \in \mathbb{R}^{nm_l \times
nm_l}$ は線形のオーバーパラメータ化を必要としません。
私たちの結果は、限界を正確に近似するには、十分な濃度を得るために、隠れ層の幅が $m \in \mathbb{N}+1$ に対して $m_k = k^2 m$ として二次関数的に増加する必要があることを示唆しています。
このような MLP の場合、限界濃度 $\mathbb{P}( \Vert K(\theta) – \overset{\scriptscriptstyle\infty}{K} \Vert \leq O((\Delta_\phi^{-2
} + m_l^{\frac{1}{2}} l) \kappa_\phi^2 m^{-\frac{1}{2}})) \geq
1-O(m^{-1})$ モジュロ対数項。$\Delta_\phi = \frac{b^2}{a^2+b^2}$ および $\kappa_\phi = \ と表します。
frac{\vert a \vert + \vert b \vert}{\sqrt{a^2 + b^2}}$。
これは特に、絶対値 ($\Delta_\phi=1$, $\kappa_\phi=1$) が ReLU ($\Delta_\phi=\frac{1}{2}$, $\kappa_) を上回ることを明らかにしています。
\phi=\sqrt{2}$) を NTK の濃度で表します。

要約(オリジナル)

We study the concentration of the Neural Tangent Kernel (NTK) $K_\theta : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ of $l$-layer Multilayer Perceptrons (MLPs) $N : \mathbb{R}^{m_0} \times \Theta \to \mathbb{R}^{m_l}$ equipped with activation functions $\phi(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ with the parameter $\theta \in \Theta$ being initialized at the Edge Of Chaos (EOC). Without relying on the gradient independence assumption that has only been shown to hold asymptotically in the infinitely wide limit, we prove that an approximate version of gradient independence holds at finite width. Showing that the NTK entries $K_\theta(x_{i_1},x_{i_2})$ for $i_1,i_2 \in [1:n]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$ concentrate simultaneously via maximal inequalities, we prove that the NTK matrix $K(\theta) = [\frac{1}{n} K_\theta(x_{i_1},x_{i_2}) : i_1,i_2 \in [1:n]] \in \mathbb{R}^{nm_l \times nm_l}$ concentrates around its infinitely wide limit $\overset{\scriptscriptstyle\infty}{K} \in \mathbb{R}^{nm_l \times nm_l}$ without the need for linear overparameterization. Our results imply that in order to accurately approximate the limit, hidden layer widths have to grow quadratically as $m_k = k^2 m$ for some $m \in \mathbb{N}+1$ for sufficient concentration. For such MLPs, we obtain the concentration bound $\mathbb{P}( \Vert K(\theta) – \overset{\scriptscriptstyle\infty}{K} \Vert \leq O((\Delta_\phi^{-2} + m_l^{\frac{1}{2}} l) \kappa_\phi^2 m^{-\frac{1}{2}})) \geq 1-O(m^{-1})$ modulo logarithmic terms, where we denoted $\Delta_\phi = \frac{b^2}{a^2+b^2}$ and $\kappa_\phi = \frac{\vert a \vert + \vert b \vert}{\sqrt{a^2 + b^2}}$. This reveals in particular that the absolute value ($\Delta_\phi=1$, $\kappa_\phi=1$) beats the ReLU ($\Delta_\phi=\frac{1}{2}$, $\kappa_\phi=\sqrt{2}$) in terms of the concentration of the NTK.

arxiv情報

著者	Dávid Terjék,Diego González-Sánchez
発行日	2025-01-24 18:58:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MLPs at the EOC: Concentration of the NTK

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー