From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

要約

データセットの多様性は、特に大規模な言語モデル（LLM）開発の監視付き微調整（SFT）段階で、多くの機械学習モデルの成功したトレーニングのために極めて重要な役割を果たします。
その重要性の認識が高まっているにもかかわらず、データセットの多様性の体系的な分析は依然として不足していません。
このギャップに対処するために、この作業は、主に命令コンポーネントに焦点を当てた既存の多様性制御戦略の体系的な分類法を提示し、巨視的なレベル（命令セマンティクス全体）またはメソスコピックレベル（命令単位）で動作し、さらに応答コンポーネント内の顕微鏡多様性の微視的多様性の新しい分析を導入し、SFTトレーニングの統計的分析を特異的に分析します。
実験的評価では、117,000のオープンソースSFTサンプルのコーパスから固定サイズのデータセット（それぞれ10,000サンプル）を構築し、マクロ、メソ、およびメソ、および微視的レベルにまたがる6つの異なる多様性制御戦略を組み込んでおり、命令と応答の両方に適用されます。
次に、これらのデータセットでLLMSを微調整して、6つの多様性制御戦略を評価します。
結果は、巨視的および中鏡戦略が多様性の増加に伴うパフォーマンスを高める一方で、応答の微視的戦略は、モデルパフォーマンスと多様性の程度と、すべての戦略にわたって最大の多様性を備えた優れたパフォーマンスとの間のより強い相関の両方を示すことを明らかにしています。
これらの調査結果は、高性能SFTデータセットを構築するための実用的な洞察を提供します。

要約(オリジナル)

Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. We then fine-tune LLMs on these datasets to assess the six diversity-control strategies. Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies. These findings offer actionable insights for constructing high-performance SFT datasets.

arxiv情報

著者	Haoyu Li,Xuhong Li,Yiming Dong,Kun Liu
発行日	2025-05-30 16:31:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー