BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

要約

大規模言語モデル (LLM) の一般的な機能は、いくつかの機関によって商業秘密として扱われる広範な事前トレーニングデータセットの構成と選択に大きく依存しています。
この問題を軽減するために、私たちは普遍的に適用可能なデータ処理パイプラインの詳細をオープンソース化し、競争力のある LLM ベースラインを導入することでその有効性と可能性を検証します。
具体的には、データ処理パイプラインは、スケールアップのための広範な収集と、品質を向上させるための再重み付けで構成されます。
次に、意図的な下流タスク関連の最適化を行わずに、パイプラインによって処理された 3T トークンを使用して 7B モデル BaichuanSEED を事前トレーニングし、その後、簡単だが効果的な教師付き微調整ステージを実行します。
BaichuanSEED は、トレーニング全体を通じて一貫性と予測可能性を実証し、Qwen1.5 や Llama3 などのいくつかの商用の高度な大規模言語モデルを使用した包括的なベンチマークで同等のパフォーマンスを達成します。
また、数学やコーディングなどの下流タスクのさらなる最適化の可能性について議論するために、いくつかのヒューリスティック実験も実施します。

要約(オリジナル)

The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.

arxiv情報

著者	Guosheng Dong,Da Pan,Yiding Sun,Shusen Zhang,Zheng Liang,Xin Wu,Yanjun Shen,Fan Yang,Haoze Sun,Tianpeng Li,Mingan Lin,Jianhua Xu,Yufan Zhang,Xiaonan Nie,Lei Su,Bingning Wang,Wentao Zhang,Jiaxin Mao,Zenan Zhou,Weipeng Chen
発行日	2024-08-27 14:08:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー