jarxiv | Japanese arxiv | ページ 178

On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

投稿日: 2025年6月6日作成者: jarxiv

要約

トランスモデルは、多様なアプリケーションでの優れたパフォーマンスにより、さまざまな科学および工学分野にわたって基本的なツールとして浮上しています。
この経験的な成功にもかかわらず、トランスの理論的基盤は、特にトレーニングのダイナミクスを理解する上で、比較的未開発のままです。
既存の研究では、特にこれらのコンポーネント間の相互依存関係を徹底的に調査することなく、特に残留接続が存在する場合、自己触媒メカニズムやフィードフォワードネットワークなど、孤立したコンポーネントを主に検討します。
このホワイトペーパーでは、構造的に完全でありながら単一層変圧器の収束挙動を分析することにより、このギャップを埋めることを目指しています。
適切な初期化の下で、勾配降下は線形収束速度を示すことを実証します。ここでは、収束速度は注意層からの出力マトリックスの最小および最大特異値によって決定されます。
さらに、我々の分析では、残留接続がこの出力マトリックスの不条件を改善するのに役立つことが明らかになりました。これは、ソフトマックス動作によって課される低ランク構造に起因する問題であり、それにより最適化の安定性の強化を促進します。
また、理論的な調査結果を多層変圧器アーキテクチャに拡張し、適切な初期化下での勾配降下の線形収束速度を確認します。
経験的結果は、私たちの理論的洞察を裏付け、収束安定性を促進する上での残留接続の有益な役割を示しています。

要約(オリジナル)

Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components–such as self-attention mechanisms and feedforward networks–without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

arxiv情報

著者	Zhen Qin,Jinxin Zhou,Zhihui Zhu
発行日	2025-06-05 17:10:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, math.OC | コメントを受け付けていません

Conservative classifiers do consistently well with improving agents: characterizing statistical and online learning

投稿日: 2025年6月6日作成者: jarxiv

要約

機械学習は、たとえば、求職者やローン申請の評価において、社会的意思決定において遍在しており、分類されたエージェントが学習アルゴリズムにどのように反応するかを考慮することがますます重要になっています。
戦略的分類に関する最近の文献の大部分は、分類されたエージェントによる欺cept的な行動の削減と対抗に焦点を当てていますが、Attias et al。
標準のPACラーニングよりも小さな一般化エラーなど、望ましい分類を達成するためにエージェントが真に改善するとき、学習性の驚くべき特性を特定します。
この論文では、複数の新しい軸にわたって改善されたいわゆる学習性を特徴付けています。
最小限の一貫した概念クラスの非対称バリアントを導入し、それを使用して、実現可能な設定で改善された適切な学習の正確な特性評価を提供します。
一般的な研究では、一般的な任意のエージェント改善地域でのみ学習可能性がありますが、より自然なユークリッドボールの改善セットについては肯定的な結果を得ています。
特に、データ分布の軽度の生成仮定の下で不適切な学習を特徴付けます。
さらに、より挑戦的な設定で学習する方法を示し、適切に研究された限界ノイズモデルの下でより低い一般化エラーを達成し、実現可能で不可知のオンライン学習で間違いの境界を獲得します。
Attiasらによって提起された未解決の質問を解決します。
適切な学習と不適切な学習の両方。

要約(オリジナル)

Machine learning is now ubiquitous in societal decision-making, for example in evaluating job candidates or loan applications, and it is increasingly important to take into account how classified agents will react to the learning algorithms. The majority of recent literature on strategic classification has focused on reducing and countering deceptive behaviors by the classified agents, but recent work of Attias et al. identifies surprising properties of learnability when the agents genuinely improve in order to attain the desirable classification, such as smaller generalization error than standard PAC-learning. In this paper we characterize so-called learnability with improvements across multiple new axes. We introduce an asymmetric variant of minimally consistent concept classes and use it to provide an exact characterization of proper learning with improvements in the realizable setting. While prior work studies learnability only under general, arbitrary agent improvement regions, we give positive results for more natural Euclidean ball improvement sets. In particular, we characterize improper learning under a mild generative assumption on the data distribution. We further show how to learn in more challenging settings, achieving lower generalization error under well-studied bounded noise models and obtaining mistake bounds in realizable and agnostic online learning. We resolve open questions posed by Attias et al. for both proper and improper learning.

arxiv情報

著者	Dravyansh Sharma,Alec Sun
発行日	2025-06-05 17:13:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.GT, cs.LG, cs.MA | コメントを受け付けていません

Continual Learning from Simulated Interactions via Multitask Prospective Rehearsal for Bionic Limb Behavior Modeling

投稿日: 2025年6月6日作成者: jarxiv

要約

下肢切断と神経筋の障害は、可動性を厳しく制限し、従来の義肢を超えた進歩を必要とします。
電動化されたバイオニック肢は有望ですが、それらの有効性は、多様な環境にわたる人間の動きの動的な調整を複製することに依存します。
この論文では、バイオニックプロテーゼコントロールの文脈で人間の行動のモデルを紹介します。
私たちのアプローチは、人間の移動デモンストレーションを活用して、下肢の相乗的な結合を学習し、歩行、登山、階段、階段などのタスク中に欠けている手足の運動学的挙動の予測を可能にします。
時間の経過とともに動きを予測および改良するマルチタスクの継続的な適応モデルを提案します。
私たちの方法の中核には、MultiTaskの前向きリハーサルと呼ばれる手法があります。これは、以前の予測に基づいて将来の動きを予測および統合し、その後の予測のための修正メカニズムを採用しています。
進化するアーキテクチャは、共有バックボーン上に軽量のタスク固有のモジュールをマージし、特異性とスケーラビリティの両方を確保します。
幅広い運動タスクにわたる、鎖角切断者を含む現実世界の人間の歩行データセットに関する実験を通じてモデルを検証します。
結果は、私たちのアプローチが、特に分布シフト、敵対的な摂動、騒音を備えたシナリオで、ベースラインモデルよりも一貫して優れていることを示しています。

要約(オリジナル)

Lower limb amputations and neuromuscular impairments severely restrict mobility, necessitating advancements beyond conventional prosthetics. While motorized bionic limbs show promise, their effectiveness depends on replicating the dynamic coordination of human movement across diverse environments. In this paper, we introduce a model for human behavior in the context of bionic prosthesis control. Our approach leverages human locomotion demonstrations to learn the synergistic coupling of the lower limbs, enabling the prediction of the kinematic behavior of a missing limb during tasks such as walking, climbing inclines, and stairs. We propose a multitasking, continually adaptive model that anticipates and refines movements over time. At the core of our method is a technique called multitask prospective rehearsal, that anticipates and synthesizes future movements based on the previous prediction and employs a corrective mechanism for subsequent predictions. Our evolving architecture merges lightweight, task-specific modules on a shared backbone, ensuring both specificity and scalability. We validate our model through experiments on real-world human gait datasets, including transtibial amputees, across a wide range of locomotion tasks. Results demonstrate that our approach consistently outperforms baseline models, particularly in scenarios with distributional shifts, adversarial perturbations, and noise.

arxiv情報

著者	Sharmita Dey,Benjamin Paassen,Sarath Ravindran Nair,Sabri Boughorbel,Arndt F. Schilling
発行日	2025-06-05 17:17:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, cs.RO | コメントを受け付けていません

Learning long range dependencies through time reversal symmetry breaking

投稿日: 2025年6月6日作成者: jarxiv

要約

Deep State Space Models（SSMS）は、RNNが動的システムにネイティブに具体化できるため、物理学に基づいたコンピューティングパラダイムを再燃させます。
これには、これらのシステムをシミュレートして設計を導くための効率的な手法を使用して、物理的な原則に従う専用の学習アルゴリズムが必要です。
再発したハミルトニアンエコーラーニング（RHEL）を提案します。これは、非微分的であるハミルトニアンシステムの物理的軌跡の有限差として損失勾配を確実に計算するアルゴリズムです。
MLの用語では、RHELは、明示的なヤコビアン計算なしで、モデルサイズに関係なく、3つの「フォワードパス」のみを必要とし、勾配推定に分散が発生しません。
アルゴリズムの物理的実現に動機付けられ、まず継続的な時間にRHELを導入し、連続的な補助状態法との正式な同等性を実証します。
RHELが訓練したハミルトニアンシステムのシミュレーションを容易にするために、ハミルトニアン再生ユニット（HRU）と呼ばれる再発モジュールのクラスに適用される場合、バックプロパゲーション（BPTT）とのバックプロパゲーションに相当するレルの離散時間バージョンを提案します。
この設定により、これらの結果をHRUSの階層に一般化することにより、RHELのスケーラビリティを実証することができます。
RHELを適用して、ミッドレンジから長距離分類と$ \ SIM 50K $に達する長距離分類と回帰までのさまざまな時系列タスクで、線形および非線形ダイナミクスを備えたHSSMを訓練します。
RHELがすべてのモデルとタスクにわたるBPTTのパフォーマンスと一貫して一致することを示します。
この作業は、シーケンスモデリングのための自己学習機能を備えたスケーラブルでエネルギー効率の高い物理システムの設計のための新しいドアを開きます。

要約(オリジナル)

Deep State Space Models (SSMs) reignite physics-grounded compute paradigms, as RNNs could natively be embodied into dynamical systems. This calls for dedicated learning algorithms obeying to core physical principles, with efficient techniques to simulate these systems and guide their design. We propose Recurrent Hamiltonian Echo Learning (RHEL), an algorithm which provably computes loss gradients as finite differences of physical trajectories of non-dissipative, Hamiltonian systems. In ML terms, RHEL only requires three ‘forward passes’ irrespective of model size, without explicit Jacobian computation, nor incurring any variance in the gradient estimation. Motivated by the physical realization of our algorithm, we first introduce RHEL in continuous time and demonstrate its formal equivalence with the continuous adjoint state method. To facilitate the simulation of Hamiltonian systems trained by RHEL, we propose a discrete-time version of RHEL which is equivalent to Backpropagation Through Time (BPTT) when applied to a class of recurrent modules which we call Hamiltonian Recurrent Units (HRUs). This setting allows us to demonstrate the scalability of RHEL by generalizing these results to hierarchies of HRUs, which we call Hamiltonian SSMs (HSSMs). We apply RHEL to train HSSMs with linear and nonlinear dynamics on a variety of time-series tasks ranging from mid-range to long-range classification and regression with sequence length reaching $\sim 50k$. We show that RHEL consistently matches the performance of BPTT across all models and tasks. This work opens new doors for the design of scalable, energy-efficient physical systems endowed with self-learning capabilities for sequence modelling.

arxiv情報

著者	Guillaume Pourcel,Maxence Ernoult
発行日	2025-06-05 17:20:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

Tight analyses of first-order methods with error feedback

投稿日: 2025年6月6日作成者: jarxiv

要約

エージェント間の通信は、分散学習における主要な計算ボトルネックになることがよくあります。
最も一般的な緩和戦略の1つは、交換された情報を圧縮し、それによりコミュニケーションのオーバーヘッドを削減することです。
圧縮通信に関連する収束の分解に対抗するために、エラーフィードバックスキーム（特に$ \ mathrm {ef} $および$ \ mathrm {ef}^{21} $）が導入されました。
この作業では、これらの両方の方法の厳しい分析を提供します。
具体的には、各メソッドの可能な限り最高の収束速度を生成するLyapunov関数が、下限が一致することがわかります。
この原則的なアプローチは、急激なパフォーマンス保証をもたらし、$ \ mathrm {ef} $、$ \ mathrm {ef}^{21} $、および圧縮勾配降下の間の厳格なリンゴとアプリの比較を可能にします。
私たちの分析は、簡素化されたが代表的な設定で実施されているため、清潔な理論的洞察と基礎となるメカニズムの公正な比較が可能になります。

要約(オリジナル)

Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes — most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ — were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method — with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in a simplified yet representative setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

arxiv情報

著者	Daniel Berg Thomsen,Adrien Taylor,Aymeric Dieuleveut
発行日	2025-06-05 17:30:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.DC, cs.LG, math.OC | コメントを受け付けていません

How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control

投稿日: 2025年6月6日作成者: jarxiv

要約

時系列の生成における最近の進歩は有望であるが、生成されたシーケンスの制御特性を制御することは依然として困難です。
時系列編集（TSE） – 時間的コヒーレンスを維持しながら正確な変更を加える – 現在の方法が提供するのに苦労しているポイントレベルの制約とセグメントレベルのコントロールの両方を検討します。
さまざまな種類の制約にわたって同時に柔軟な制御を有効にするために、カクテル編集フレームワークを紹介します。
このフレームワークには、2つの重要なメカニズムが組み合わされています。ポイントワイズ制約の信頼加重アンカー制御と、セグメントの平均や平均などの統計的特性を管理するための分類器ベースの制御です。
私たちの方法は、条件付き訓練された拡散ベースの時系列モデルと、一時的なコヒーレンスを維持し、シームレスに統合しながら、除去推論段階で正確な局所制御を実現します。
多様なデータセットとモデルにわたる広範な実験は、その有効性を示しています。
私たちの仕事は、純粋な生成モデリングと現実世界の時系列の編集ニーズの間のギャップを埋め、ループ内の時系列の生成と編集に柔軟なソリューションを提供します。
コードとデモは検証用に提供されます。

要約(オリジナル)

Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) – making precise modifications while preserving temporal coherence – consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, flexible control across different types of constraints. This framework combines two key mechanisms: a confidence-weighted anchor control for point-wise constraints and a classifier-based control for managing statistical properties such as sums and averages over segments. Our methods achieve precise local control during the denoising inference stage while maintaining temporal coherence and integrating seamlessly, with any conditionally trained diffusion-based time series models. Extensive experiments across diverse datasets and models demonstrate its effectiveness. Our work bridges the gap between pure generative modeling and real-world time series editing needs, offering a flexible solution for human-in-the-loop time series generation and editing. The code and demo are provided for validation.

arxiv情報

著者	Hao Yu,Chu Xin Cheng,Runlong Yu,Yuyang Ye,Shiwei Tong,Zhaofeng Liu,Defu Lian
発行日	2025-06-05 17:32:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

Learning Beyond Experience: Generalizing to Unseen State Space with Reservoir Computing

投稿日: 2025年6月6日作成者: jarxiv

要約

機械学習技術は、観察されたデータからのみ動的システムをモデル化するための効果的なアプローチを提供します。
ただし、これらの手法は、通常、トレーニングデータに不十分に表されているダイナミクスの側面に一般化するのに苦労している、明示的な構造的前提（基礎となるダイナミクスに関する仮定が組み込まれていない）がなければ。
ここでは、ダイナミカルシステムのデータ駆動型モデリングによく使用されるシンプルで効率的で汎用性の高い機械学習フレームワークである貯水池コンピューティングが、明示的な構造プライアーなしで未開拓の状態空間の領域に一般化できることを実証します。
まず、馬鹿げた時系列のコレクション全体でトレーニングをサポートし、利用可能なトレーニングデータの効果的な使用を可能にする貯水池コンピューターの多注文トレーニングスキームについて説明します。
次に、このトレーニングスキームを多数の動的システムに適用すると、単一のアトラクション盆地からの軌跡をトレーニングしたRCSが、完全に観測されていない盆地でシステムの動作をキャプチャすることにより、ドメイン外の一般化を達成できることを示します。

要約(オリジナル)

Machine learning techniques offer an effective approach to modeling dynamical systems solely from observed data. However, without explicit structural priors — built-in assumptions about the underlying dynamics — these techniques typically struggle to generalize to aspects of the dynamics that are poorly represented in the training data. Here, we demonstrate that reservoir computing — a simple, efficient, and versatile machine learning framework often used for data-driven modeling of dynamical systems — can generalize to unexplored regions of state space without explicit structural priors. First, we describe a multiple-trajectory training scheme for reservoir computers that supports training across a collection of disjoint time series, enabling effective use of available training data. Then, applying this training scheme to multistable dynamical systems, we show that RCs trained on trajectories from a single basin of attraction can achieve out-of-domain generalization by capturing system behavior in entirely unobserved basins.

arxiv情報

著者	Declan A. Norton,Yuanzhao Zhang,Michelle Girvan
発行日	2025-06-05 17:46:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, math.DS, nlin.CD, physics.comp-ph | コメントを受け付けていません

A Smooth Sea Never Made a Skilled $\texttt{SAILOR}$: Robust Imitation via Learning to Search

投稿日: 2025年6月6日作成者: jarxiv

要約

模倣学習に対する行動クローニング（BC）アプローチの基本的な制限は、専門家が訪れた州で専門家が行ったことのみを教えることです。
これは、BCエージェントがデモンストレーションのサポートからそれらを奪う間違いを犯したとき、彼らはしばしばそれから回復する方法を知らないことを意味します。
この意味で、BCは、エージェントに魚を魚に教えるのではなく、狭い状態で密集した監督を与えることに似ています。テスト時に見えない状況に直面しても、専門家の結果を達成することについて独立して推論できるように。
これに応じて、専門家のデモンストレーションから検索（L2）を検索することを検討します。つまり、テスト時に必要なコンポーネントを学習し、間違いを犯した後でも、専門家の結果を一致させることを計画しています。
これらには、（1）世界モデルと（2）報酬モデルが含まれます。
これらのコンポーネントと他のコンポーネントを組み合わせて、追加の人間の補正なしで安定したサンプル/相互作用効率の回復行動の学習に必要なアルゴリズムと設計の決定のセットを慎重に除去します。
3つのベンチマークからのダースの視覚操作タスクを超えて、私たちのアプローチ$ \ Texttt {Sailor} $は、同じデータでBCを介してトレーニングされた最先端の拡散ポリシーを一貫してパフォーマンスしています。
さらに、BCに使用されるデモンストレーションの量を5-10 $ \ Times $でスケールアップすると、パフォーマンスギャップが残ります。
$ \ texttt {Sailor} $は、微妙な障害を識別し、ハッキングに報いるのに堅牢であることがわかります。
私たちのコードは、https：//github.com/arnavkj1995/sailorで入手できます。

要約(オリジナル)

The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don’t know how to recover from it. In this sense, BC is akin to giving the agent the fish — giving them dense supervision across a narrow set of states — rather than teaching them to fish: to be able to reason independently about achieving the expert’s outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach $\texttt{SAILOR}$ consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10$\times$ still leaves a performance gap. We find that $\texttt{SAILOR}$ can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .

arxiv情報

著者	Arnav Kumar Jain,Vibhakar Mohta,Subin Kim,Atiksh Bhardwaj,Juntao Ren,Yunhai Feng,Sanjiban Choudhury,Gokul Swamy
発行日	2025-06-05 17:47:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

Power Law Guided Dynamic Sifting for Efficient Attention

投稿日: 2025年6月6日作成者: jarxiv

要約

特に注意計算における高帯域幅メモリ（HBM）とSRAMの間のデータ転送中、メモリ帯域幅の制限のために、大きな言語モデルを使用したGPUの効率的な推論は依然として困難です。
おおよその注意方法は、計算およびメモリのオーバーヘッドを削減することによりこの問題に対処しますが、GPUでパフォーマンスが低い高価な$ $ k $操作に依存することがよくあります。
Siftattentionを提案します。これは、上位$ k $ステップを、しきい値に基づいて計算効率の高い要素ごとのフィルタリング操作に置き換える新しい近似注意方法です。
これを行うための私たちの直感は、注意スコアの$ \ tau $ -thitileが連続した生成ステップを介した予測可能なパワーローに従うという経験的な観察に基づいています。
この洞察を活用すると、私たちのアプローチは、各世代のステップでのプロンプトあたりのしきい値を動的に推定します。
このしきい値を超える注意スコアと、対応する値ベクトルは、注意出力を計算するためにロード/使用され、HBMとSRAMの間のデータの動きを削減します。
私たちの評価は、Siftattentionが既存の近似注意方法よりもモデルの品質をよりよく保持し、値ベクトルをロードするときにメモリ帯域幅の使用を削減することを示しています。

要約(オリジナル)

Efficient inference on GPUs using large language models remains challenging due to memory bandwidth limitations, particularly during data transfers between High Bandwidth Memory (HBM) and SRAM in attention computations. Approximate attention methods address this issue by reducing computational and memory overhead but often rely on expensive top-$k$ operations, which perform poorly on GPUs. We propose SiftAttention, a novel approximate attention method that replaces the top-$k$ step with a computationally efficient element-wise filtering operation based on a threshold value. Our intuition for doing this is based on our empirical observation that the $\tau$-th quantile of attention scores follows a predictable power-law over sequential generation steps. Exploiting this insight, our approach dynamically estimates a threshold value per prompt at each generation step. Only attention scores above this threshold and their corresponding value vectors are loaded/used to compute the attention output, reducing data movement between HBM and SRAM. Our evaluation demonstrates that SiftAttention preserves model quality better than existing approximate attention methods while reducing memory bandwidth usage when loading value vectors.

arxiv情報

著者	Nirav Koley,Prajwal Singhania,Abhinav Bhatele
発行日	2025-06-05 17:50:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

Learning normalized image densities via dual score matching

投稿日: 2025年6月6日作成者: jarxiv

要約

データからの学習確率モデルは多くの機械学習の努力の中心にありますが、次元の呪いのために難しいことで有名です。
スコアを推定するために最適化されたネットワークに依存する、拡散生成モデルからインスピレーションを受けた学習\ emph {remormized}エネルギー（ログ確率）モデルのための新しいフレームワークを紹介します。
スコアネットワークアーキテクチャを変更して、誘導バイアスを維持しながらエネルギーを計算します。
入力画像に関するこのエネルギーネットワークの勾配は、学習密度のスコアであり、除去目標を使用して最適化できます。
重要なことに、ノイズレベルに関する勾配は、新しい二次目標で最適化できる追加スコアを提供し、ノイズレベル全体で一貫した正規化されたエネルギーを確保します。
Imagenet64データセットのこの\ emph {dual}スコアマッチング目標を使用してエネルギーネットワークをトレーニングし、アートの最新技術に匹敵するクロスエントロピー（負の対数尤度）値を取得します。
さらに、エネルギーモデル\ end {強く一般化}：推定ログ確率がトレーニングセットの特定の画像とほぼ依存していることを示すことで、アプローチを検証します。
最後に、ローカル近隣の画像の確率と次元の両方が、測定の集中や低次元の多様体へのサポートなどの従来の仮定とは対照的に、画像含有量によって大きく異なることを実証します。

要約(オリジナル)

Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: estimated log probabilities are nearly independent of the specific images in the training set. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary significantly with image content, in contrast with traditional assumptions such as concentration of measure or support on a low-dimensional manifold.

arxiv情報

著者	Florentin Guth,Zahra Kadkhodaie,Eero P Simoncelli
発行日	2025-06-05 17:53:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント