Personalizing Task-oriented Dialog Systems via Zero-shot Generalizable Reward Function

要約

タスク指向の対話システムにより、ユーザーは自然言語を使用してタスクを達成できます。
最先端のシステムは、パーソナリティに関係なくユーザーに同じように応答しますが、ダイアログをパーソナライズすることで、より高いレベルの採用とより良いユーザーエクスペリエンスにつながる可能性があります。
パーソナライズされた対話システムを構築することは、重要でありながら挑戦的な試みであり、その挑戦に取り組んだ作品はほんの一握りです。
既存の作業のほとんどは教師あり学習アプローチに依存しており、ユーザープロファイルごとに面倒で費用のかかるラベル付きトレーニングデータが必要です。
さらに、各ユーザープロファイルのデータを収集してラベル付けすることは事実上不可能です。
この作業では、ゼロショットの一般化可能な報酬関数を使用して、教師なしの方法で幅広いユーザープロファイルに適応できるタスク指向のダイアログシステムをパーソナライズするための新しいフレームワーク P-ToD を提案します。
P-ToD は、事前トレーニング済みの GPT-2 をバックボーンモデルとして使用し、3 つのフェーズで動作します。
フェーズ 1 では、タスク固有のトレーニングを実行します。
フェーズ 2 では、ゼロショットの一般化可能な報酬関数によって導かれるポリシー勾配を実行する近位ポリシー最適化アルゴリズムを活用して、教師なしの個人化を開始します。
私たちの新しい報酬関数は、目に見えないプロファイルであっても、生成された応答の品質を定量化できます。
オプションの最終段階では、ラベル付けされたいくつかのトレーニング例を使用して、パーソナライズされたモデルを微調整します。
5 つのタスクと最大 180 の多様なユーザープロファイルに対して、パーソナライズされた bAbI ダイアログベンチマークを使用して、広範な実験的分析を行います。
実験結果は、P-ToD が、ラベル付けされたサンプルにアクセスできなかった場合でも、最新の教師ありパーソナライゼーションモデルよりも優れており、強力な完全教師あり GPT-2 と比較した場合、BLEU および ROUGE メトリックで競争力のあるパフォーマンスを達成することを示しています。
ベースライン

要約(オリジナル)

Task-oriented dialog systems enable users to accomplish tasks using natural language. State-of-the-art systems respond to users in the same way regardless of their personalities, although personalizing dialogues can lead to higher levels of adoption and better user experiences. Building personalized dialog systems is an important, yet challenging endeavor and only a handful of works took on the challenge. Most existing works rely on supervised learning approaches and require laborious and expensive labeled training data for each user profile. Additionally, collecting and labeling data for each user profile is virtually impossible. In this work, we propose a novel framework, P-ToD, to personalize task-oriented dialog systems capable of adapting to a wide range of user profiles in an unsupervised fashion using a zero-shot generalizable reward function. P-ToD uses a pre-trained GPT-2 as a backbone model and works in three phases. Phase one performs task-specific training. Phase two kicks off unsupervised personalization by leveraging the proximal policy optimization algorithm that performs policy gradients guided by the zero-shot generalizable reward function. Our novel reward function can quantify the quality of the generated responses even for unseen profiles. The optional final phase fine-tunes the personalized model using a few labeled training examples. We conduct extensive experimental analysis using the personalized bAbI dialogue benchmark for five tasks and up to 180 diverse user profiles. The experimental results demonstrate that P-ToD, even when it had access to zero labeled examples, outperforms state-of-the-art supervised personalization models and achieves competitive performance on BLEU and ROUGE metrics when compared to a strong fully-supervised GPT-2 baseline

arxiv情報

著者	A. B. Siddique,M. H. Maqbool,Kshitija Taywade,Hassan Foroosh
発行日	2023-03-24 04:33:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Personalizing Task-oriented Dialog Systems via Zero-shot Generalizable Reward Function

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー