Extreme Q-Learning: MaxEnt RL without Entropy

要約

最新の深層強化学習 (RL) アルゴリズムでは、最大 Q 値の推定が必要ですが、可能なアクションが無限にある連続ドメインで計算するのは困難です。
この作業では、経済学からインスピレーションを得て、極値理論 (EVT) を使用して最大値を直接モデル化する、オンラインおよびオフラインの RL の新しい更新ルールを導入します。
そうすることで、多くの場合、実質的なエラーの原因となる分布外アクションを使用して Q 値を計算することを回避できます。
私たちの重要な洞察は、ポリシーからサンプリングする必要なく、最大エントロピー RL 設定で最適なソフト値関数 (LogSumExp) を直接推定する目的を導入することです。
EVT を使用して、\emph{Extreme Q-Learning} フレームワークを導出し、その結果、ポリシーまたはそのエントロピーへの明示的なアクセスを必要としない、オンラインおよび初めてオフラインの MaxEnt Q ラーニングアルゴリズムを導出しました。
私たちの方法は、D4RL ベンチマークで一貫して強力なパフォーマンスを達成し、困難なフランカキッチンタスクで以前の作業を \emph{10+ ポイント} 上回っており、オンライン DM コントロールタスクで SAC および TD3 よりも中程度の改善を提供しています。
ビジュアライゼーションとコードは、https://div99.github.io/XQL/ の Web サイトで見つけることができます。

要約(オリジナル)

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by \emph{10+ points} on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website at https://div99.github.io/XQL/.

arxiv情報

著者	Divyansh Garg,Joey Hejna,Matthieu Geist,Stefano Ermon
発行日	2023-02-28 22:14:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Extreme Q-Learning: MaxEnt RL without Entropy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー