Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

要約

値関数は、深層強化学習 (RL) の中心的なコンポーネントです。
ニューラルネットワークによってパラメータ化されたこれらの関数は、ブートストラップされたターゲット値と一致するように平均二乗誤差回帰目標を使用してトレーニングされます。
ただし、大容量トランスフォーマーなどの大規模ネットワークへの回帰を使用する値ベースの RL 手法のスケーリングは困難であることがわかっています。
この困難さは教師あり学習とはまったく対照的です。教師あり学習は、クロスエントロピー分類損失を活用することで、大規模ネットワークまで確実に拡張できます。
この不一致を観察して、この論文では、トレーニング値関数の回帰の代わりに分類を使用するだけで、ディープ RL のスケーラビリティも改善できるかどうかを調査します。
カテゴリカルクロスエントロピーを使用してトレーニングされた値関数が、さまざまなドメインでパフォーマンスとスケーラビリティを大幅に向上させることを実証します。
これらには、SoftMoE を使用した Atari 2600 ゲームでのシングルタスク RL、大規模 ResNet を使用した Atari でのマルチタスク RL、Q トランスフォーマーを使用したロボット操作、検索なしのチェスのプレイ、および大容量トランスフォーマーを使用した言語エージェント Wordle タスクが含まれます。
、これらのドメインで最先端の結果を達成します。
慎重な分析を通じて、カテゴリカルクロスエントロピーの利点は主に、ノイズの多いターゲットや非定常性など、値ベースの RL に固有の問題を軽減する能力に由来することを示します。
全体として、カテゴリカルクロスエントロピーを使用したトレーニング値関数への単純な移行により、ほとんどコストをかけずにディープ RL のスケーラビリティを大幅に向上できると主張します。

要約(オリジナル)

Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.

arxiv情報

著者	Jesse Farebrother,Jordi Orbay,Quan Vuong,Adrien Ali Taïga,Yevgen Chebotar,Ted Xiao,Alex Irpan,Sergey Levine,Pablo Samuel Castro,Aleksandra Faust,Aviral Kumar,Rishabh Agarwal
発行日	2024-03-06 18:55:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー