Policy Synthesis and Reinforcement Learning for Discounted LTL


報酬関数を手動で指定するのは難しいため、線形時相論理 (LTL) を使用して強化学習 (RL) の目標を表現することに関心が集まっています。
ただし、LTL には、遷移確率の小さな摂動に敏感であるという欠点があり、追加の仮定なしではおそらくほぼ正しい (PAC) 学習が妨げられます。


The difficulty of manually specifying reward functions has led to an interest in using linear temporal logic (LTL) to express objectives for reinforcement learning (RL). However, LTL has the downside that it is sensitive to small perturbations in the transition probabilities, which prevents probably approximately correct (PAC) learning without additional assumptions. Time discounting provides a way of removing this sensitivity, while retaining the high expressivity of the logic. We study the use of discounted LTL for policy synthesis in Markov decision processes with unknown transition probabilities, and show how to reduce discounted LTL to discounted-sum reward via a reward machine when all discount factors are identical.


著者 Rajeev Alur,Osbert Bastani,Kishor Jothimurugan,Mateo Perez,Fabio Somenzi,Ashutosh Trivedi
発行日 2023-05-26 17:32:38+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, cs.LO パーマリンク