Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

要約

不均衡な分類と誤った相関は、データサイエンスと機械学習における一般的な課題です。
どちらの問題もデータの不均衡に関連しており、特定のグループのデータサンプルが大幅に過小評価されており、その結果、学習されたモデルの精度、堅牢性、一般化可能性が損なわれる可能性があります。
最近の進歩により、通常はトランスフォーマーアーキテクチャ上に構築される大規模言語モデル (LLM) の柔軟性と生成機能を活用して、合成サンプルを生成し、観測データを増強することが提案されています。
不均衡なデータのコンテキストでは、LLM を使用して過小評価されたグループをオーバーサンプリングし、有望な改善が見られます。
しかし、このような合成データのアプローチについては理論的な理解が明らかに不足しています。
この記事では、不均衡な分類と偽相関に対処する際の合成サンプルの役割を体系的に研究するための新しい理論的基礎を開発します。
具体的には、まず合成オーバーサンプリングの利点を明示的に定量化します。
次に、合成データ拡張におけるスケーリングのダイナミクスを分析し、対応するスケーリング則を導き出します。
最後に、高品質の合成サンプルを生成する変圧器モデルの能力を実証します。
さらに、LLM ベースの合成オーバーサンプリングと拡張の有効性を検証するために、広範な数値実験を実施します。

要約(オリジナル)

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.

arxiv情報

著者	Ryumei Nakada,Yichen Xu,Lexin Li,Linjun Zhang
発行日	2025-01-06 15:37:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー