Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs

要約

Trivedi らは、人間が解釈可能で、新しいシナリオをより適切に一般化できる強化学習 (RL) ポリシーを作成することを目指しています。
(2021) は、あらかじめ生成されたプログラムデータセットから、多様なプログラムを連続的にパラメータ化するためのプログラム埋め込み空間を学習し、課題が与えられた際に、学習したプログラム埋め込み空間内で課題解決プログラムを探索する方法（LEAPS）を提案している。
有望な結果にもかかわらず、LEAPS が作成できるプログラムポリシーは、プログラムデータセットの配布によって制限されます。
さらに、検索中に、LEAPS は各候補プログラムをその戻り値のみに基づいて評価し、プログラムの正しい部分に正確に報酬を与えたり、間違った部分にペナルティを与えることができません。
これらの問題に対処するために、学習されたプログラム埋め込み空間からサンプリングされた一連のプログラムを構成するメタポリシーを学習することを提案します。
私たちが提案する階層型プログラム強化学習 (HPRL) フレームワークは、プログラムの構成を学習することで、分布外の複雑な動作を記述し、望ましい動作を誘発するプログラムにクレジットを直接割り当てるプログラムポリシーを生成できます。
カレルドメインでの実験結果は、私たちが提案したフレームワークがベースラインを上回るパフォーマンスを示していることを示しています。
アブレーション研究により、LEAPS の限界が確認され、設計の選択が正当化されました。

要約(オリジナル)

Aiming to produce reinforcement learning (RL) policies that are human-interpretable and can generalize better to novel scenarios, Trivedi et al. (2021) present a method (LEAPS) that first learns a program embedding space to continuously parameterize diverse programs from a pre-generated program dataset, and then searches for a task-solving program in the learned program embedding space when given a task. Despite the encouraging results, the program policies that LEAPS can produce are limited by the distribution of the program dataset. Furthermore, during searching, LEAPS evaluates each candidate program solely based on its return, failing to precisely reward correct parts of programs and penalize incorrect parts. To address these issues, we propose to learn a meta-policy that composes a series of programs sampled from the learned program embedding space. By learning to compose programs, our proposed hierarchical programmatic reinforcement learning (HPRL) framework can produce program policies that describe out-of-distributionally complex behaviors and directly assign credits to programs that induce desired behaviors. The experimental results in the Karel domain show that our proposed framework outperforms baselines. The ablation studies confirm the limitations of LEAPS and justify our design choices.

arxiv情報

著者	Guan-Ting Liu,En-Pei Hu,Pu-Jen Cheng,Hung-yi Lee,Shao-Hua Sun
発行日	2023-05-31 09:08:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー