SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

要約

事前トレーニングおよび命令の微調整段階での大規模言語モデル (LLM) のデータ選択は有効であるにもかかわらず、特殊なドメインの教師あり微調整 (SFT) におけるデータ効率の向上には、微調整データの複雑さにより大きな課題が生じます。
このギャップを埋めるために、SFT 用の効果的でスケーラブルなデータ選択方法、SmallToLarge (S2L) を導入します。これは、小さなモデルからのトレーニング軌跡を活用して、より大きなモデルのデータ選択をガイドします。
私たちは広範な実験を通じて、S2L が数学的問題解決のための SFT におけるデータ効率を大幅に向上させ、トレーニングデータを元の MathInstruct データセット (Yue et al., 2023) のわずか 11% に削減して、完全なデータセットのパフォーマンスに匹敵しつつ、現状のパフォーマンスを上回ることを実証しました。
-最先端のデータ選択アルゴリズムにより、6 つのドメイン内およびドメイン外の評価データセット全体で平均 4.7% の向上を実現しました。
注目すべきことに、SFT 用に 50K データのみを選択すると、S2L は最も困難な MATH (Hendrycks et al., 2021) ベンチマークで 32.7% の精度を達成し、Phi-2 (Li et al., 2023b) を 16.6% 改善します。
MIMIC-III データセットでの臨床テキストの要約 (Johnson et al., 2016) では、S2L はデータの 50% のみを使用した完全なデータセットでのトレーニングを再び上回りました。
特に、S2L はターゲットモデルよりも 40 分の 1 小さい参照モデルを使用してデータ選択を実行できるため、データ選択のコストもそれに比例して削減されます。

要約(オリジナル)

Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.

arxiv情報

著者	Yu Yang,Siddhartha Mishra,Jeffrey N Chiang,Baharan Mirzasoleiman
発行日	2024-12-05 18:47:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー