GD doesn’t make the cut: Three ways that non-differentiability affects neural network training

要約

この論文では、非微分可能関数 (NGDM) に適用される勾配法と、微分可能関数に対する古典的な勾配降下法 (GD) の間の基本的な違いを批判的に検証し、現在の深層学習最適化理論の大きなギャップを明らかにします。
我々は、NGDMがGDと比較して著しく異なる収束特性を示すことを実証し、$L-smoothness$に基づく広範なニューラルネットワーク収束文献の非滑らかなニューラルネットワークへの適用可能性に強く挑戦します。
私たちの分析により、$L_{1}$ 正則化問題に対する NDGM 解の逆説的な挙動が明らかになりました。正則化を増加させると、直観に反して最適解の $L_{1}$ ノルムが大きくなるということです。
この発見は、ネットワークプルーニングに広く採用されている $L_{1}$ ペナルティ手法に疑問を投げかけます。
さらに、RMSProp のような最適化アルゴリズムは微分可能コンテキストでも微分不可能コンテキストでも同様に動作するという一般的な仮定に疑問を呈します。
安定性の端の現象を拡張して、リプシッツの連続凸微分可能関数を含む、より広範なクラスの関数でその現象が発生することを実証します。
この発見は、非凸かつ微分不可能なニューラルネットワーク、特に ReLU 活性化を使用するニューラルネットワークにおけるその関連性と解釈について重要な疑問を引き起こします。
私たちの研究では、影響力のある文献における NDGM の重大な誤解が特定されており、これは強力な滑らかさの仮定への過度の依存に起因しています。
これらの発見により、ディープラーニングにおける最適化ダイナミクスの再評価が必要となり、これらの複雑なシステムを分析する際には、より微妙な理論的基礎が不可欠であることが強調されています。

要約(オリジナル)

This paper critically examines the fundamental distinctions between gradient methods applied to non-differentiable functions (NGDMs) and classical gradient descents (GDs) for differentiable functions, revealing significant gaps in current deep learning optimization theory. We demonstrate that NGDMs exhibit markedly different convergence properties compared to GDs, strongly challenging the applicability of extensive neural network convergence literature based on $L-smoothness$ to non-smooth neural networks. Our analysis reveals paradoxical behavior of NDGM solutions for $L_{1}$-regularized problems, where increasing regularization counterintuitively leads to larger $L_{1}$ norms of optimal solutions. This finding calls into question widely adopted $L_{1}$ penalization techniques for network pruning. We further challenge the common assumption that optimization algorithms like RMSProp behave similarly in differentiable and non-differentiable contexts. Expanding on the Edge of Stability phenomenon, we demonstrate its occurrence in a broader class of functions, including Lipschitz continuous convex differentiable functions. This finding raises important questions about its relevance and interpretation in non-convex, non-differentiable neural networks, particularly those using ReLU activations. Our work identifies critical misunderstandings of NDGMs in influential literature, stemming from an overreliance on strong smoothness assumptions. These findings necessitate a reevaluation of optimization dynamics in deep learning, emphasizing the crucial need for more nuanced theoretical foundations in analyzing these complex systems.

arxiv情報

著者	Siddharth Krishna Kumar
発行日	2024-11-07 18:22:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GD doesn’t make the cut: Three ways that non-differentiability affects neural network training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー