AugGPT: Leveraging ChatGPT for Text Data Augmentation

要約

テキストデータの拡張は、多くの自然言語処理 (NLP) タスクにおける限られたサンプルサイズの課題を克服するための効果的な戦略です。
この課題は、ターゲットドメインのデータが一般的にはるかに少なく、品質が低い、数ショット学習シナリオで特に顕著です。
このような課題を軽減するための自然で広く使用されている戦略は、データ拡張を実行して、データの不変性をより適切に捉え、サンプルサイズを増やすことです。
ただし、現在のテキストデータ拡張方法では、生成されたデータの正しいラベル付けを保証できない (忠実性に欠ける) か、生成されたデータの十分な多様性を保証できない (コンパクト性に欠ける)、またはその両方です。
大規模な言語モデルの最近の成功、特に言語理解能力の向上を実証した ChatGPT の開発に触発されて、この作業では、ChatGPT (AugGPT と命名) に基づくテキストデータ拡張アプローチを提案します。
AugGPT は、トレーニングサンプルの各センテンスを、概念的には類似しているが意味的には異なる複数のサンプルに言い換えます。
拡張されたサンプルは、下流のモデルトレーニングで使用できます。
少数ショット学習テキスト分類タスクに関する実験結果は、テスト精度と拡張サンプルの分布に関して、最先端のテキストデータ拡張方法よりも提案された AugGPT アプローチの優れたパフォーマンスを示しています。

要約(オリジナル)

Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data augmentation to better capture the data invariance and increase the sample size. However, current text data augmentation methods either can’t ensure the correct labeling of the generated data (lacking faithfulness) or can’t ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models, especially the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data augmentation approach based on ChatGPT (named AugGPT). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.

arxiv情報

著者	Haixing Dai,Zhengliang Liu,Wenxiong Liao,Xiaoke Huang,Yihan Cao,Zihao Wu,Lin Zhao,Shaochen Xu,Wei Liu,Ninghao Liu,Sheng Li,Dajiang Zhu,Hongmin Cai,Lichao Sun,Quanzheng Li,Dinggang Shen,Tianming Liu,Xiang Li
発行日	2023-03-20 11:39:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AugGPT: Leveraging ChatGPT for Text Data Augmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー