WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

要約

DPOから蒸留まで、訓練後の言語モデル（LLM）は、行動を改良し、新しいスキルのロックを解除できますが、これらのトレーニング後の技術をサポートするオープンサイエンスはまだ初期段階にあります。
1つの制限要因は、合成データ生成モデルとLLMジャッジの大規模な比較分析を実施することの難しさです。
このギャップを埋めるために、これまでで最大のパブリックチャットデータセットであるWildChat-50Mを紹介します。
既存のWildChatデータセットを拡張して、GPTからだけでなく、0.5Bから104Bのパラメーターのサイズの50を超えるオープンウェイトモデルからの応答を含めます。
広範な比較分析を実施し、このデータセットの可能性を実証します。これは、アレンAIからの最近のTulu-3 SFT混合物を40％のサンプルでしかよりも優れたRewildであるPublic SFT Mixを作成します。
データセット、サンプル、コードは、https：//github.com/penfever/wildchat-50mで入手できます。

要約(オリジナル)

Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.

arxiv情報

著者	Benjamin Feuer,Chinmay Hegde
発行日	2025-01-30 17:21:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー