Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs

要約

マルコフ決定プロセス（MDP）におけるオフライン模倣学習の問題を研究します。ここでは、専門家のポリシーによって生成された状態アクションペアのデータセットを考慮して、パフォーマンスの良いポリシーを学習することです。
専門家が既知のポリシーの扱いやすいクラスに属していると仮定するこのトピックに関する最近の作業ラインを補完することで、この問題に新しい角度からアプローチし、環境に関する異なるタイプの構造的仮定を活用します。
具体的には、線形$ q^\ pi $ -realizable mdpsのクラスについては、サドルポイントオフライン模倣学習（\ sail）と呼ばれる新しいアルゴリズムを紹介します。
さらに、この結果は、注文$ \ mathcal {o}（\ varepsilon^{-4}）$のより悪いサンプルの複雑さを犠牲にして、おそらく非線形$ q^\ pi $ -realizable mdpsに拡張します。
Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning.
標準的なベンチマークの経験的評価は、\ waitのニューラルネットの実装が行動のクローン化よりも優れており、最先端のアルゴリズムと競合することを示しています。

要約(オリジナル)

We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^\pi$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (\SPOIL), which is guaranteed to match the performance of any expert up to an additive error $\varepsilon$ with access to $\mathcal{O}(\varepsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^\pi$-realizable MDPs at the cost of a worse sample complexity of order $\mathcal{O}(\varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of \SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.

arxiv情報

著者	Antoine Moulin,Gergely Neu,Luca Viano
発行日	2025-06-02 13:30:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー