Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

要約

データの多様性は、大規模な言語モデルの指示の調整に重要です。
既存の研究では、高品質のデータセットを構築し、モデルのパフォーマンスを向上させるために、さまざまな多様性を意識したデータ選択方法を調査しました。
ただし、データの多様性を正確に定義および測定するという根本的な問題は、標準不足のままであり、データエンジニアリングの明確なガイダンスを制限しています。
これに対処するために、広範な微調整実験を通じてモデルパフォーマンスとの相関を評価することにより、11の既存の多様性測定方法を体系的に分析します。
我々の結果は、信頼できる多様性尺度がサンプル間の違いとサンプル空間の情報密度の両方を適切に説明する必要があることを示しています。
これに基づいて、サンプルレベルの「ノベルティ」に基づいた新しい多様性メトリックであるNovelsumを提案します。
シミュレートされたデータと現実世界の両方のデータの実験は、Noblesumが多様性の変動を正確に捉え、命令チューニングされたモデルパフォーマンスと0.97の相関を達成し、データエンジニアリングの実践を導く際のその価値を強調することを示しています。
Novelsumを最適化目標として、既存のアプローチを上回る貪欲で多様性指向のデータ選択戦略をさらに開発し、メトリックの有効性と実用的な重要性の両方を検証します。
このコードはhttps://github.com/umeannever/novelsumで入手できます。

要約(オリジナル)

Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by evaluating their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level ‘novelty.’ Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance, highlighting its value in guiding data engineering practices. With NovelSum as an optimization objective, we further develop a greedy, diversity-oriented data selection strategy that outperforms existing approaches, validating both the effectiveness and practical significance of our metric. The code is available at https://github.com/UmeanNever/NovelSum.

arxiv情報

著者	Yuming Yang,Yang Nan,Junjie Ye,Shihan Dou,Xiao Wang,Shuo Li,Huijie Lv,Mingqi Wu,Tao Gui,Qi Zhang,Xuanjing Huang
発行日	2025-06-02 15:41:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー