Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework

要約

強化学習（RL）の環境遷移モデルには不確実性が内在しているため、探索と探索の微妙なバランスが必要となる。このバランスは、エージェントに期待される報酬を正確に推定するために計算資源を最適化する上で極めて重要である。ロボット制御システムのような報酬が疎なシナリオでは、このバランスを達成することは特に困難である。しかし、多くの環境は広範な事前知識を持っているため、このような文脈で一から学習することは冗長になる可能性がある。この問題に対処するために、我々は、新しいサンプル効率の良いフレームワークである言語モデル誘導報酬チューニング（Language Model Guided reward Tuning: LMGT）を提案する。LMGTは、大規模言語モデル（LLM）に埋め込まれた包括的な事前知識と、Wikiチュートリアルのような非標準的なデータ形式を処理する能力を活用する。LLMが誘導する報酬シフトを利用することで、LMGTは探索と探索のバランスを巧みにとり、それによってエージェントの探索行動を誘導し、サンプル効率を向上させる。我々は、様々なRLタスクにおいてLMGTを厳密に評価し、具現化されたロボット環境Housekeepにおいて評価した。その結果、LMGTはベースライン手法を常に凌駕することが実証された。さらに、この結果は、我々のフレームワークがRL学習段階で必要とされる計算資源を大幅に削減できることを示唆している。

要約(オリジナル)

The inherent uncertainty in the environmental transition model of Reinforcement Learning (RL) necessitates a delicate balance between exploration and exploitation. This balance is crucial for optimizing computational resources to accurately estimate expected rewards for the agent. In scenarios with sparse rewards, such as robotic control systems, achieving this balance is particularly challenging. However, given that many environments possess extensive prior knowledge, learning from the ground up in such contexts may be redundant. To address this issue, we propose Language Model Guided reward Tuning (LMGT), a novel, sample-efficient framework. LMGT leverages the comprehensive prior knowledge embedded in Large Language Models (LLMs) and their proficiency in processing non-standard data forms, such as wiki tutorials. By utilizing LLM-guided reward shifts, LMGT adeptly balances exploration and exploitation, thereby guiding the agent’s exploratory behavior and enhancing sample efficiency. We have rigorously evaluated LMGT across various RL tasks and evaluated it in the embodied robotic environment Housekeep. Our results demonstrate that LMGT consistently outperforms baseline methods. Furthermore, the findings suggest that our framework can substantially reduce the computational resources required during the RL training phase.

arxiv情報

著者	Yongxin Deng,Xihe Qiu,Jue Chen,Xiaoyu Tan
発行日	2025-05-02 09:58:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー