Learning a Diffusion Model Policy from Rewards via Q-Score Matching

要約

拡散モデルは、行動の複製やオフライン強化学習においてアクターのポリシーを表すための一般的な選択肢となっています。
これは、連続空間上で表現力豊かな分布クラスを最適化する自然な能力によるものです。
しかし、以前の作品は拡散モデルのスコアベースの構造を活用できず、代わりに単純な行動複製用語を利用して俳優を訓練し、俳優と批評家の設定における能力を制限していました。
この論文では、政策のスコアと Q 関数の行動勾配の間の構造をリンクすることにより、普及モデル政策の構造を学習済みの Q 関数にリンクする理論的枠組みを提示します。
我々はオフポリシー強化学習に焦点を当て、この理論から新しいポリシー更新方法を提案します。これをQスコアマッチングと呼びます。
特に、このアルゴリズムは拡散モデル評価全体ではなく、ノイズ除去モデルを通じてのみ区別する必要があり、Q スコアマッチングによる収束ポリシーは暗黙的にマルチモーダルであり、連続領域で探索的です。
シミュレートされた環境で実験を実施して、提案した方法の実行可能性を実証し、一般的なベースラインと比較します。
ソースコードはプロジェクト Web サイトから入手できます: https://michaelpsenka.io/qsm。

要約(オリジナル)

Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. Notably, this algorithm only needs to differentiate through the denoising model rather than the entire diffusion model evaluation, and converged policies through Q-score matching are implicitly multi-modal and explorative in continuous domains. We conduct experiments in simulated environments to demonstrate the viability of our proposed method and compare to popular baselines. Source code is available from the project website: https://michaelpsenka.io/qsm.

arxiv情報

著者	Michael Psenka,Alejandro Escontrela,Pieter Abbeel,Yi Ma
発行日	2024-07-16 13:24:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー