DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

要約

会話型 AI の進歩にもかかわらず、言語モデルは多様な会話タスクを処理するという課題に直面しており、既存の対話データセットコレクションは多様性と包括性に欠けていることがよくあります。
これらの問題に取り組むために、DialogStudio を紹介します。DialogStudio は、元の情報を維持しながら一貫した形式で統合された、最大かつ最も多様な対話データセットのコレクションです。
私たちのコレクションには、オープンドメインの対話、タスク指向の対話、自然言語理解、会話の推奨、対話の要約、知識に基づいた対話からのデータが含まれており、対話の研究とモデルのトレーニングのための信じられないほど豊富で多様なリソースとなっています。
DialogStudio のユーティリティをさらに強化するために、各データセットのライセンスを特定し、選択したダイアログに対するドメイン対応のプロンプトを設計して、指示に応じた微調整を容易にします。
さらに、データセットコレクションを使用して会話型 AI モデルを開発し、ゼロショット学習シナリオと少数ショット学習シナリオの両方での実験により、DialogStudio の優位性が実証されました。
透明性を向上させ、データセットとタスクベースの調査、および言語モデルの事前トレーニングをサポートするために、DialogStudio に関連付けられたすべてのデータセット、ライセンス、コード、モデルは https://github.com/salesforce/DialogStudio で公開されています。

要約(オリジナル)

Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training. To further enhance the utility of DialogStudio, we identify the licenses for each dataset and design domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we develop conversational AI models using the dataset collection, and our experiments in both zero-shot and few-shot learning scenarios demonstrate the superiority of DialogStudio. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio are made publicly accessible at https://github.com/salesforce/DialogStudio

arxiv情報

著者	Jianguo Zhang,Kun Qian,Zhiwei Liu,Shelby Heinecke,Rui Meng,Ye Liu,Zhou Yu,Huan Wang,Silvio Savarese,Caiming Xiong
発行日	2023-07-20 17:59:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー