A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories

要約

観察からのオフライン模倣は、タスク固有のエキスパート状態とタスクに依存しない非エキスパート状態とアクションのペアのみが利用可能な MDP を解決することを目的としています。
オフラインの模倣は、任意のインタラクションにコストがかかり、専門家のアクションが利用できない現実のシナリオで役立ちます。
最先端の「分散補正推定」(DICE) メソッドは、エキスパートポリシーと学習者ポリシーの間の状態占有の相違を最小限に抑え、重み付けされた動作クローンを使用してポリシーを取得します。
ただし、デュアルドメインでの最適化がロバストでないため、不完全な軌道から学習すると結果が不安定になります。
この問題に対処するために、この論文では、観察からの軌道認識模倣学習 (TAILO) を提案します。
TAILO は、重み付けされた動作クローン作成の重みとして、将来の軌跡に沿った割引合計を使用します。
合計の項は、エキスパート状態を識別することを目的としたディスクリミネーターの出力によってスケーリングされます。
シンプルであるにもかかわらず、TAILO は、タスクに依存しないデータに専門家の行動の軌跡またはセグメントが存在する場合にうまく機能します。これは、以前の研究で一般的に想定されていました。
複数のテストベッドにわたる実験では、特に不完全な軌道の場合、TAILO の方が堅牢で効果的であることがわかりました。

要約(オリジナル)

Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art ‘DIstribution Correction Estimation’ (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.

arxiv情報

著者	Kai Yan,Alexander G. Schwing,Yu-Xiong Wang
発行日	2023-11-02 15:41:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー