Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout

要約

この論文では、非線形性の高い (決定論的ではあるが) 環境で動作する自律エージェントのフィードバックコントローラーをトレーニングするためのモデルベースのアプローチを紹介します。
私たちは、エージェントが離散時間信号時論理 (DT-STL) で表現される特定のタスク目標と安全制約を確実に満たすように、トレーニングされたポリシーを必要とします。
DT-STL のような正式なフレームワークを介してタスクを再定式化する利点の 1 つは、定量的な満足セマンティクスが可能になることです。
言い換えれば、軌道と DT-STL 式が与えられると、{\em ロバストネス} を計算できます。これは、軌道と式を満たす一連の軌道の間の近似符号付き距離として解釈できます。
フィードバック制御を利用しており、フィードバックコントローラの学習にはフィードフォワードニューラルネットワークを想定しています。
この学習問題が、リカレントニューラルネットワーク (RNN) のトレーニングにどのように似ているかを示します。ここで、リカレントユニットの数は、エージェントのタスク目標の時間的範囲に比例します。
これは課題を引き起こします。RNN は勾配の消失や爆発の影響を受けやすく、長期的なタスク目標を解決するための単純な勾配降下ベースの戦略も同じ問題に悩まされます。
この課題に取り組むために、ドロップアウトまたは勾配サンプリングの考えに基づいた新しい勾配近似アルゴリズムを導入します。
主な貢献の 1 つは、{\em コントローラーネットワークドロップアウト} の概念です。この概念では、前のトレーニングステップでコントローラーを使用して取得した制御入力によって、タスク範囲内のいくつかのタイムステップで NN コントローラーを近似します。
私たちの制御合成手法は、確率的勾配降下法をより少ない数値問題で収束させるのに非常に役立ち、長い時間範囲にわたるスケーラブルなバックプロパゲーションと高次元の状態空間にわたる軌道を可能にすることを示します。

要約(オリジナル)

This paper introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear (albeit deterministic) environment. We desire the trained policy to ensure that the agent satisfies specific task objectives and safety constraints, both expressed in Discrete-Time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the {\em robustness}, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback control, and we assume a feed forward neural network for learning the feedback controller. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent’s task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and na\'{i}ve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To tackle this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. One of the main contributions is the notion of {\em controller network dropout}, where we approximate the NN controller in several time-steps in the task horizon by the control input obtained using the controller in a previous training step. We show that our control synthesis methodology, can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable backpropagation over long time horizons and trajectories over high dimensional state spaces.

arxiv情報

著者	Navid Hashemi,Bardh Hoxha,Danil Prokhorov,Georgios Fainekos,Jyotirmoy Deshmukh
発行日	2024-08-27 22:18:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー