Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative

要約

ダイアログポリシーは、各ダイアログターンの現在の状態に基づいてシステムのアクションを決定します。ダイアログポリシーは、ダイアログの成功にとって非常に重要です。
近年、強化学習 (RL) が対話ポリシー学習 (DPL) の有望なオプションとして浮上しています。
RL ベースの DPL では、報酬に応じてダイアログポリシーが更新されます。
状態とアクションのペアの組み合わせが多数あるマルチドメインのタスク指向のダイアログシナリオでは、対話ポリシーを効果的に導くための状態アクションベースの報酬などのきめ細かい報酬を手動で構築するのは困難です。
収集されたデータから報酬を推定する 1 つの方法は、敵対的学習 (AL) を使用して報酬推定器と対話ポリシーを同時にトレーニングすることです。
この方法は実験的に優れたパフォーマンスを示していますが、モード崩壊などの AL に固有の問題を抱えています。
この論文はまず、対話政策と報酬推定量の目的関数の詳細な分析を通じて、DPL における AL の役割を特定します。
次に、これらの分析に基づいて、AL の利点を維持しながら報酬推定と DPL から AL を排除する手法を提案します。
マルチドメインのタスク指向の対話コーパスである MultiWOZ を使用してメソッドを評価します。

要約(オリジナル)

Dialog policies, which determine a system’s action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based ones, to effectively guide the dialog policy is challenging in multi-domain task-oriented dialog scenarios with numerous state-action pair combinations. One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL). Although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse. This paper first identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. Next, based on these analyses, we propose a method that eliminates AL from reward estimation and DPL while retaining its advantages. We evaluate our method using MultiWOZ, a multi-domain task-oriented dialog corpus.

arxiv情報

著者	Sho Shimoyama,Tetsuro Morimura,Kenshi Abe,Toda Takamichi,Yuta Tomomatsu,Masakazu Sugiyama,Asahi Hentona,Yuuki Azuma,Hirotaka Ninomiya
発行日	2023-07-13 12:29:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー