Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

要約

人間は複数の感覚を通して世界を知覚し、周囲の包括的な表現を作成し、ドメイン全体で情報を一般化できるようにします。
たとえば、シーンのテキストの説明が与えられると、人間はそれを精神的に視覚化できます。
ロボット工学や強化学習（RL）などの分野では、エージェントは複数のセンサーを介して環境に関する情報にアクセスすることもできます。
しかし、センサー間の冗長性と相補性は、堅牢性の原因（センサーの障害に対する例：ドメインを横断する転送など）として活用することが困難です。
以前の研究では、「グローバルワークスペース」の認知科学の概念に基づいて、堅牢で柔軟なマルチモーダル表現を効率的に構築できることが実証されました。
ここでは、このような脳に触発されたマルチモーダル表現がRLエージェントにとって有利である可能性があるかどうかを調査します。
まず、「グローバルワークスペース」を訓練して、2つの入力モダリティ（視覚入力、またはエージェントの状態および/またはその環境を表す属性ベクトル）を介して環境について収集された情報を活用します。
次に、この凍結グローバルワークスペースを使用してRLエージェントポリシーをトレーニングします。
2つの異なる環境とタスクで、我々の結果は、入力モダリティ間でゼロショットクロスモーダル転送を実行するモデルの能力、つまり、追加のトレーニングや微調整なしに、属性ベクトル（および逆）で以前にトレーニングされたポリシーを画像入力に適用する能力を明らかにします。
完全なグローバルワークスペースのバリエーションとアブレーション（コントラスト学習を介して訓練されたクリップのようなマルチモーダル表現を含む）は、同じ一般化能力を表示しませんでした。

要約(オリジナル)

Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a ‘Global Workspace’: a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a ‘Global Workspace’ to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model’s ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.

arxiv情報

著者	Léopold Maytié,Benjamin Devillers,Alexandre Arnold,Rufin VanRullen
発行日	2025-06-04 15:52:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー