Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping

要約

模倣学習により、ロボットは、器用な操作設定に挑戦する際に、複雑で長距離のタスクを実行することができました。
新しい方法が開発されると、それらは厳密に評価され、繰り返し評価試験を通じて対応するベースラインと比較する必要があります。
ただし、ポリシーの比較は、人間の大幅な努力と政策の限られた推論のスループットにより、小さな実行可能なサンプルサイズ（10または50など）によって根本的に制約されています。
このペーパーでは、小さなサンプルサイズ体制の2つのポリシーを厳密に比較するための新しい統計的枠組みを提案します。
統計政策比較の以前の研究は、バッチテストに依存しています。これには、固定された事前に決定された試行数が必要であり、サンプルサイズを観察された評価データに適応させる柔軟性がありません。
さらに、追加の試験でテストを拡張すると、不注意なPハッキングを誘発するリスクがあり、統計的保証を弱めます。
対照的に、提案された統計テストは連続的であり、研究者は中間結果に基づいてより多くの試験を実行するかどうかを決定できるようにします。
これは、根本的な比較の難しさに合わせて試験の数を調整し、確率的正しさを犠牲にすることなくかなりの時間と労力を節約します。
広範な数値シミュレーションと現実世界のロボット操作実験により、このテストが最適に近い停止を達成し、研究者が評価を停止し、最小数の試験で決定を下すことができます。
具体的には、比較の確率的正しさと統計的能力を維持しながら、最先端のベースラインと比較して、評価試験の数を最大32％減らします。
さらに、私たちの方法は、最も挑戦的な比較インスタンスで最も強くなっています（ほとんどの評価試験が必要です）。
マルチタスクの比較シナリオでは、評価者を160を超えるシミュレーションロールアウトを保存します。

要約(オリジナル)

Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new methods are developed, they must be rigorously evaluated and compared against corresponding baselines through repeated evaluation trials. However, policy comparison is fundamentally constrained by a small feasible sample size (e.g., 10 or 50) due to significant human effort and limited inference throughput of policies. This paper proposes a novel statistical framework for rigorously comparing two policies in the small sample size regime. Prior work in statistical policy comparison relies on batch testing, which requires a fixed, pre-determined number of trials and lacks flexibility in adapting the sample size to the observed evaluation data. Furthermore, extending the test with additional trials risks inducing inadvertent p-hacking, undermining statistical assurances. In contrast, our proposed statistical test is sequential, allowing researchers to decide whether or not to run more trials based on intermediate results. This adaptively tailors the number of trials to the difficulty of the underlying comparison, saving significant time and effort without sacrificing probabilistic correctness. Extensive numerical simulation and real-world robot manipulation experiments show that our test achieves near-optimal stopping, letting researchers stop evaluation and make a decision in a near-minimal number of trials. Specifically, it reduces the number of evaluation trials by up to 32% as compared to state-of-the-art baselines, while preserving the probabilistic correctness and statistical power of the comparison. Moreover, our method is strongest in the most challenging comparison instances (requiring the most evaluation trials); in a multi-task comparison scenario, we save the evaluator more than 160 simulation rollouts.

arxiv情報

著者	David Snyder,Asher James Hancock,Apurva Badithela,Emma Dixon,Patrick Miller,Rares Andrei Ambrus,Anirudha Majumdar,Masha Itkina,Haruki Nishimura
発行日	2025-06-06 14:24:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー