Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

要約

教師なし強化学習 (RL) のエントロピー最小化とエントロピー最大化 (好奇心) 目標は、環境の自然エントロピーのレベルに応じて、さまざまな環境で効果的であることが示されています。
ただし、どちらの方法単独でも、エージェントが環境全体でインテリジェントな動作を一貫して学習することはできません。
あらゆる環境において創発的行動を促す単一のエントロピーベースの方法を見つける取り組みにおいて、選択を多腕バンディット問題として組み立てることにより、エントロピーの条件に応じてオンラインで目的を適応できるエージェントを提案します。
私たちは、環境内のエントロピーを制御するエージェントの能力を捕捉する、バンディット用の新しい固有のフィードバック信号を考案しました。
我々は、そのようなエージェントがエントロピーの制御を学習し、高エントロピー領域と低エントロピー領域の両方で創発的な行動を示すことができ、ベンチマークタスクで巧みな行動を学習できることを実証します。
トレーニングを受けたエージェントのビデオと調査結果の要約は、プロジェクトページ https://sites.google.com/view/surprise-adaptive-agents でご覧いただけます。

要約(オリジナル)

Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment’s level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent’s ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviors in benchmark tasks. Videos of the trained agents and summarized findings can be found on our project page https://sites.google.com/view/surprise-adaptive-agents

arxiv情報

著者	Adriana Hugessen,Roger Creus Castanyer,Faisal Mohamed,Glen Berseth
発行日	2024-08-16 17:55:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー