Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

要約

近年、テキスト分類に大規模言語モデル (LLM) を使用することが広く注目を集めています。
それにもかかわらず、LLM の分類精度は、まだ全体的に小規模なモデルの分類精度を超えていません。
LLM は、微調整を通じてテキスト分類のパフォーマンスを向上させることができます。
ただし、LLM に基づく既存のデータ品質研究をテキスト分類問題の解決に直接適用することは困難です。
分類タスクにおける LLM のパフォーマンスをさらに向上させるために、この論文では、LLM に基づくテキスト分類のためのデータ品質拡張 (DQE) 手法を提案します。
この方法では、まず貪欲なアルゴリズムを使用してデータを選択し、データセットをサンプリングされたサブセットと非サンプリングのサブセットに分割し、次にサンプリングされたデータを使用して LLM の微調整を実行します。
その後、このモデルを使用して非サンプリングデータの結果を予測し、誤って予測されたデータを未カバーのデータ、困難なデータ、ノイズの多いデータに分類します。
実験結果は、私たちの方法がテキスト分類タスクにおけるLLMのパフォーマンスを効果的に強化し、トレーニング効率を大幅に向上させ、トレーニング時間をほぼ半分に節約することを示しています。
私たちの手法は、いくつかのオープンソース分類タスクで最先端のパフォーマンスを達成しました。

要約(オリジナル)

In recent years, the use of large language models (LLMs) for text classification has attracted widespread attention. Despite this, the classification accuracy of LLMs has not yet universally surpassed that of smaller models. LLMs can enhance their performance in text classification through fine-tuning. However, existing data quality research based on LLMs is challenging to apply directly to solve text classification problems. To further improve the performance of LLMs in classification tasks, this paper proposes a data quality enhancement (DQE) method for text classification based on LLMs. This method starts by using a greedy algorithm to select data, dividing the dataset into sampled and unsampled subsets, and then performing fine-tuning of the LLMs using the sampled data. Subsequently, this model is used to predict the outcomes for the unsampled data, categorizing incorrectly predicted data into uncovered, difficult, and noisy data. Experimental results demonstrate that our method effectively enhances the performance of LLMs in text classification tasks and significantly improves training efficiency, saving nearly half of the training time. Our method has achieved state-of-the-art performance in several open-source classification tasks.

arxiv情報

著者	Min Zeng,Caiquan Liu,Shiqi Zhang,Li Xie,Chen Sang,Xiaoxin Chen,Xiaoxin Chen
発行日	2024-12-09 15:28:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー