Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

要約

ユニバーサル学習システムの開発の中心となるのは、新しいデータが到着したときに最初から再トレーニングすることなく複数のタスクを解決できる機能です。
各タスクには多大なトレーニング時間が必要となるため、これは非常に重要です。
継続的な学習の問題に対処するには、問題空間の複雑さのため、さまざまな方法が必要です。
この問題空間には、(1) 以前に学習したタスクを保持するという致命的な忘れへの対処、(2) より迅速な学習のための積極的な前方転送の実証、(3) 多数のタスクにわたるスケーラビリティの確保、(4) たとえタスクのラベルを必要とせずに学習を促進することが含まれます。
明確なタスクの境界が存在しないこと。
このペーパーでは、Task-Agnostic Policy Distillation (TAPD) フレームワークを紹介します。
このフレームワークは、タスク非依存フェーズを組み込むことによって問題 (1) ～ (4) を軽減します。このフェーズでは、エージェントは外部の目標を持たずに環境を探索し、内発的動機のみを最大化します。
この段階で得られた知識は、後でさらに調査するために蒸留されます。
したがって、エージェントは体系的に新しい状態を探索することにより、自己監視された方法で動作します。
タスクに依存しない抽出された知識を利用することで、エージェントは下流のタスクをより効率的に解決でき、サンプル効率の向上につながります。
私たちのコードはリポジトリ https://github.com/wabbajack1/TAPD で入手できます。

要約(オリジナル)

Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: https://github.com/wabbajack1/TAPD.

arxiv情報

著者	Muhammad Burhan Hafez,Kerim Erekmen
発行日	2024-11-25 16:18:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー