The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

要約

大規模な言語モデル（LLM）は、通常、膨大な量の免許のないテキストで訓練されています。これは、知的財産の侵害と倫理的懸念の可能性があるため、精査につながった実践です。
公然とライセンスされたテキストでのLLMSのトレーニングは、これらの問題に対処するための最初のステップを提示しますが、以前のデータ収集の取り組みにより、パフォーマンスのLLMを生成するには小さすぎるまたは低品質のデータセットが得られました。
このギャップに対処するために、LLM Pretraining用に設計された公然とライセンスされたテキストの8テラバイトコレクションであるCommon Pile V0.1を収集、キュレート、およびリリースします。
一般的なパイルは、研究論文、コード、本、百科事典、教育資料、オーディオトランスクリプトなどを含む多様なドメインにまたがる30のソースからのコンテンツで構成されています。
重要なことは、それぞれ1兆トークンと2兆トークンで訓練された、Comma V0.1-1TとComma V0.1-2Tのテキストで2つの70億パラメーターLLMをトレーニングすることにより、努力を検証します。
どちらのモデルも、LLAMA 1や2 7bなどの同様の計算予算で訓練されたLLMSに競争力のあるパフォーマンスを実現します。
Common Pile V0.1自体をリリースすることに加えて、Comma V0.1モデルのトレーニング混合とチェックポイントだけでなく、その作成で使用されるコードもリリースします。

要約(オリジナル)

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

arxiv情報

著者	Nikhil Kandpal,Brian Lester,Colin Raffel,Sebastian Majstorovic,Stella Biderman,Baber Abbasi,Luca Soldaini,Enrico Shippole,A. Feder Cooper,Aviya Skowron,John Kirchenbauer,Shayne Longpre,Lintang Sutawika,Alon Albalak,Zhenlin Xu,Guilherme Penedo,Loubna Ben Allal,Elie Bakouch,John David Pressman,Honglu Fan,Dashiell Stander,Guangyu Song,Aaron Gokaslan,Tom Goldstein,Brian R. Bartoldson,Bhavya Kailkhura,Tyler Murray
発行日	2025-06-05 16:21:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー