Better Synthetic Data by Retrieving and Transforming Existing Datasets

要約

大規模な言語モデルの最近の進歩にもかかわらず、信頼性が高く展開可能な NLP モデルを構築するには、通常、豊富で高品質のトレーニングデータが必要です。
ただし、タスク固有のデータは多くのユースケースでは利用できず、タスク固有のデータを手動でキュレーションするのは労力がかかります。
最近の研究では、大規模な言語モデルを使用したプロンプト駆動の合成データ生成が研究されていますが、これらの生成されたデータセットは複雑さと多様性に欠ける傾向があります。
これらの制限に対処するために、既存の公開されているデータセットを有効活用してデータセットの自動生成を改善するメソッド \textit{DataTune} を導入します。
DataTune はデータセット変換を実行し、公開されているデータセットをターゲットタスクの特定の要件に直接適合する形式に再利用できるようにします。
BIG-Bench ベンチマークのさまざまな言語ベースのタスクで、DataTune を介して言語モデルを微調整すると、数回のプロンプトベースラインよりも 49% 改善され、合成または取得されたトレーニングデータを使用する既存の方法よりも 34% 改善されることがわかりました。
\%。
データセットの変換により、多くのタスクで生成されたデータの多様性と難易度が大幅に増加することがわかりました。
DataTune をオープンソースリポジトリに統合して、コミュニティがこのメソッドにアクセスできるようにします: https://github.com/neulab/prompt2model。

要約(オリジナル)

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, \textit{DataTune}, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

arxiv情報

著者	Saumya Gandhi,Ritu Gala,Vijay Viswanathan,Tongshuang Wu,Graham Neubig
発行日	2024-04-22 17:15:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Better Synthetic Data by Retrieving and Transforming Existing Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー