Safe MDP Planning by Learning Temporal Patterns of Undesirable Trajectories and Averting Negative Side Effects

要約

タイトル：安全なMDP計画による、望ましくない軌跡の時間パターンの学習と負の副作用の回避

要約：
– 安全なMDP計画では、現在の状態とアクションに基づいたコスト関数がしばしば安全性の側面を指定するために使用されます。
– 現実世界では、使用される状態表現は、このような安全制約を指定するために十分な精度を持っていない場合があります。
– 不完全なモデルに基づいて操作することは、しばしば意図しない負の副作用（NSEs）を引き起こすことがあります。
– これらの課題に対処するために、まず、安全信号を状態アクションの軌跡に関連付けます（ただし、ただの即時状態アクションではありません）。これにより、私たちの安全モデルは非常に一般的になります。
– また、数値的なコスト関数よりも、問題設計者が指定するのがより困難な、異なる軌跡に対するカテゴリカルな安全ラベルが与えられると仮定します。そのような非マルコフな安全パターンを学習するために、教師あり学習モデルを使用します。
– 次に、安全モデルと基礎となるMDPモデルを単一の計算グラフに統合するラグランジュ乗数法を開発し、エージェントの安全な行動の学習を容易にします。
– 最後に、多様な離散および連続領域の経験的結果から、このアプローチは複雑な非マルコフな安全制約を満たしながらエージェントの総利益を最適化し、高いスケーラビリティを持ち、マルコフなNSEsの前の最高のアプローチよりも優れています。

要約(オリジナル)

In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. In the real world, often the state representation used may lack sufficient fidelity to specify such safety constraints. Operating based on an incomplete model can often produce unintended negative side effects (NSEs). To address these challenges, first, we associate safety signals with state-action trajectories (rather than just an immediate state-action). This makes our safety model highly general. We also assume categorical safety labels are given for different trajectories, rather than a numerical cost function, which is harder to specify by the problem designer. We then employ a supervised learning model to learn such non-Markovian safety patterns. Second, we develop a Lagrange multiplier method, which incorporates the safety model and the underlying MDP model in a single computation graph to facilitate agent learning of safe behaviors. Finally, our empirical results on a variety of discrete and continuous domains show that this approach can satisfy complex non-Markovian safety constraints while optimizing an agent’s total returns, is highly scalable, and is also better than the previous best approach for Markovian NSEs.

arxiv情報

著者	Siow Meng Low,Akshat Kumar,Scott Sanner
発行日	2023-04-06 14:03:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Safe MDP Planning by Learning Temporal Patterns of Undesirable Trajectories and Averting Negative Side Effects

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー