Approximation Bounds for Transformer Networks with Application to Regression

要約

H \ ‘古い関数とソボレフ関数のトランスネットワークの近似能力を調査し、これらの結果を適用して、依存観測でノンパラメトリック回帰推定に対処します。
First, we establish novel upper bounds for standard Transformer networks approximating sequence-to-sequence mappings whose component functions are H\’older continuous with smoothness index $\gamma \in (0,1]$. To achieve an approximation error $\varepsilon$ under the $L^p$-norm for $p \in [1, \infty]$, it suffices to use a fixed-depth Transformer network whose total number of
パラメーターは、$ \ varepsilon^{ – d_x n / \ gamma} $を拡大します
さまざまな$ \ beta $ -mixingデータの仮定に基づくノンパラメトリック回帰問題。これにより、サンプルの複雑さの境界が弱くなると、重量の大きさに制約があります。
変圧器の自己触媒層が列平均化を実行できる場合、ネットワークはシーケンスからシーケンスh \ ‘より古い関数を近似できることを示しており、自己関節メカニズムの解釈可能性に関する新しい洞察を提供します。

要約(オリジナル)

We explore the approximation capabilities of Transformer networks for H\’older and Sobolev functions, and apply these results to address nonparametric regression estimation with dependent observations. First, we establish novel upper bounds for standard Transformer networks approximating sequence-to-sequence mappings whose component functions are H\’older continuous with smoothness index $\gamma \in (0,1]$. To achieve an approximation error $\varepsilon$ under the $L^p$-norm for $p \in [1, \infty]$, it suffices to use a fixed-depth Transformer network whose total number of parameters scales as $\varepsilon^{-d_x n / \gamma}$. This result not only extends existing findings to include the case $p = \infty$, but also matches the best known upper bounds on number of parameters previously obtained for fixed-depth FNNs and RNNs. Similar bounds are also derived for Sobolev functions. Second, we derive explicit convergence rates for the nonparametric regression problem under various $\beta$-mixing data assumptions, which allow the dependence between observations to weaken over time. Our bounds on the sample complexity impose no constraints on weight magnitudes. Lastly, we propose a novel proof strategy to establish approximation bounds, inspired by the Kolmogorov-Arnold representation theorem. We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence H\’older functions, offering new insights into the interpretability of self-attention mechanisms.

arxiv情報

著者 Yuling Jiao,Yanming Lai,Defeng Sun,Yang Wang,Bokai Yan
発行日 2025-04-16 15:25:58+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, stat.ML パーマリンク