The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation

要約

この論文では、一次関数近似を使用したオフライン RL 問題を研究します。
私たちの主な構造的仮定は、MDP には固有のベルマン誤差が低いということであり、これは線形値関数が貪欲ポリシーに関して線形ベルマンバックアップを持つことを規定します。
この仮定は、本質的に値の反復が成功するために必要な最小限の仮定であるという点で自然です。
データセット上の単一ポリシーカバレッジ条件下で成功する、計算効率の高いアルゴリズムを提供します。つまり、データセットによって十分にカバーされているポリシーの値以上の値を持つポリシーを出力します。
固有のベルマン誤差が 0 (線形ベルマン完全性と呼ばれる) の設定でも、私たちのアルゴリズムは、単一ポリシーカバレッジの下で最初に知られている保証をもたらします。
正の固有ベルマン誤差 ${\varepsilon_{\mathrm{BE}}} > 0$ の設定では、アルゴリズムの次善誤差が $\sqrt{\varepsilon_{\mathrm{BE}}}$ に応じてスケールされることを示します。
。
さらに、$\sqrt{\varepsilon_{\mathrm{BE}}}$ による準最適性のスケーリングは、どのアルゴリズムでも改善できないことを証明します。
私たちの下限は、誤指定エラーを伴う強化学習の他の多くの設定とは対照的です。通常、誤指定エラーに比例してパフォーマンスが低下するパフォーマンスが得られます。

要約(オリジナル)

In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error ${\varepsilon_{\mathrm{BE}}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\varepsilon_{\mathrm{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.

arxiv情報

著者	Noah Golowich,Ankur Moitra
発行日	2024-06-17 16:04:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー