Value-Based Deep RL Scales Predictably

要約

データと計算のスケーリングは、機械学習の成功にとって重要です。
ただし、スケーリングには予測可能性が需要があります。メソッドは、より多くのコンピューティングまたはデータでうまく機能するだけでなく、大規模な実験を実行せずに、小規模な実行からパフォーマンスを予測可能にすることも望ましいです。
この論文では、価値ベースのオフポリティRLメソッドが、彼らの病理学的行動に関するコミュニティの伝承にもかかわらず、予測可能であることを示します。
まず、特定のパフォーマンスレベルを達成するためのデータと計算要件は、Paretoフロンティアに嘘をついていることを示しています。
このフロンティアを推定することにより、より多くの計算が与えられたときにこのデータ要件を予測できます。また、より多くのデータが与えられた場合は、この計算要件を予測できます。
第二に、データ間での総リソース予算の最適な割り当てを決定し、特定のパフォーマンスを計算し、それを使用して、特定の予算のパフォーマンスを最大化するハイパーパラメーターを決定します。
第三に、このスケーリング動作は、最初にハイパーパラメーター間の予測可能な関係を推定することにより有効になります。これは、RLに固有の過剰適合および可塑性損失の効果を管理するために使用されます。
Deepmind Control、Openai Gym、およびIsaacgymのSAC、BRO、およびPQLの3つのアルゴリズムを使用してアプローチを検証します。

要約(オリジナル)

Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

arxiv情報

著者	Oleh Rybkin,Michal Nauman,Preston Fu,Charlie Snell,Pieter Abbeel,Sergey Levine,Aviral Kumar
発行日	2025-02-06 18:59:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Value-Based Deep RL Scales Predictably

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー