Occam Gradient Descent

要約

深層学習ニューラルネットワークモデルは、問題領域に適応するのに十分な大きさであると同時に、勾配降下中にトレーニングデータの過学習を避けるのに十分な大きさである必要があります。
これらの競合する需要のバランスをとるために、トランスフォーマーなどのオーバープロビジョニングされたディープラーニングモデルは、大規模なデータセットで単一エポックに対してトレーニングされるため、コンピューティングリソースとトレーニングデータの両方で非効率的になります。
これらの非効率性に対応して、学習理論を利用して、モデルサイズの適応的縮小をインターリーブして汎化誤差を最小限に抑えるアルゴリズムであるオッカム勾配降下法と、フィッティング誤差を最小限に抑えるモデルの重みの勾配降下法を導き出します。
対照的に、従来の勾配降下法では、汎化誤差を考慮せずに、フィッティング誤差を貪欲に最小限に抑えます。
私たちのアルゴリズムは、修正することなく、ニューラルネットワークの重み空間とトポロジーサイズを同時に下降します。
損失、計算、モデルサイズに関して、私たちの実験では、(a) 画像分類ベンチマークにおいて、オッカム勾配降下法で訓練された線形および畳み込みニューラルネットワークは、訓練後の枝刈りの有無にかかわらず、従来の勾配降下法よりも優れたパフォーマンスを発揮することが示されました。
(b) さまざまな表形式データ分類タスクにおいて、オッカム勾配降下法でトレーニングされたニューラルネットワークは、従来の勾配降下法やランダムフォレストよりも優れたパフォーマンスを発揮します。
(c) 自然言語変換では、Occam 勾配降下法は従来の勾配降下法を上回ります。

要約(オリジナル)

Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, overprovisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we exploit learning theory to derive Occam Gradient Descent, an algorithm that interleaves adaptive reduction of model size to minimize generalization error, with gradient descent on model weights to minimize fitting error. In contrast, traditional gradient descent greedily minimizes fitting error without regard to generalization error. Our algorithm simultaneously descends the space of weights and topological size of any neural network without modification. With respect to loss, compute and model size, our experiments show (a) on image classification benchmarks, linear and convolutional neural networks trained with Occam Gradient Descent outperform traditional gradient descent with or without post-train pruning; (b) on a range of tabular data classification tasks, neural networks trained with Occam Gradient Descent outperform traditional gradient descent, as well as Random Forests; (c) on natural language transformers, Occam Gradient Descent outperforms traditional gradient descent.

arxiv情報

著者	B. N. Kausik
発行日	2024-07-31 17:57:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Occam Gradient Descent

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー