SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

要約

大規模言語モデル (LLM) は多用途であり、多くのタスクに対処できますが、計算効率を高めるために、その機能を小規模なスチューデントモデルに抽出することが望ましいことがよくあります。
分類タスクでこれを行う 1 つの方法はデータセット合成を使用することです。これは、LLM から各ラベルの例を生成することで実現できます。
従来の合成アプローチでは、LLM のパラメトリック知識に依存して使用可能な例を生成する、少数ショットプロンプトが使用されていました。
ただし、これにより、繰り返しの問題、人気のあるエンティティに対する偏見、および人間のテキストとの文体の違いが生じます。
この研究では、検索拡張を使用してデータセット合成プロセスに多様性を導入する Synthesize by Retrieval and Refinement (SynthesizRR) を提案します。取得されたパッセージが異なると、LLM に異なるコンテンツが「シード」されてサンプルが生成されます。
私たちは、複雑な合成戦略を必要とする、トピック分類、感情分析、トーン検出、ユーモアをカバーする 6 つのデータセットの合成を実証的に研究しています。
SynthesizRR は、標準的な 32 ショットプロンプトと 6 つのベースラインアプローチと比較して、語彙と意味の多様性、人間が書いたテキストとの類似性、蒸留パフォーマンスを大幅に向上させることがわかりました。

要約(オリジナル)

Large language models (LLMs) are versatile and can address many tasks, but for computational efficiency, it is often desirable to distill their capabilities into smaller student models. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM’s parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is ‘seeded’ with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to standard 32-shot prompting and six baseline approaches.

arxiv情報

著者	Abhishek Divekar,Greg Durrett
発行日	2024-05-16 12:22:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー