DataComp: In search of the next generation of multimodal datasets

要約

マルチモーダルデータセットは、安定拡散や GPT-4 などの最近の進歩において重要なコンポーネントですが、その設計はモデルアーキテクチャやトレーニングアルゴリズムほど研究の注目を集めていません。
ML エコシステムのこの欠点に対処するために、Common Crawl の 128 億の画像とテキストのペアの新しい候補プールを中心としたデータセット実験用のテストベッドである DataComp を導入します。
ベンチマークの参加者は、新しいフィルタリング手法を設計するか、新しいデータソースを厳選し、標準化された CLIP トレーニングコードを実行し、結果のモデルを 38 の下流テストセットでテストすることで新しいデータセットを評価します。
私たちのベンチマークは 4 桁にわたる複数のコンピューティングスケールで構成されており、これによりスケーリング傾向の研究が可能になり、さまざまなリソースを持つ研究者がベンチマークにアクセスできるようになります。
私たちのベースライン実験では、DataComp ワークフローがより優れたトレーニングセットにつながることを示しています。
特に、当社の最高のベースラインである DataComp-1B は、CLIP ViT-L/14 をゼロからトレーニングして、ImageNet 上で 79.2% のゼロショット精度を実現し、同じトレーニング手順を使用しながら OpenAI の CLIP ViT-L/14 を 3.7 パーセントポイント上回ります。
そして計算します。
DataComp とそれに付随するすべてのコードは www.datacomp.ai でリリースされます。

要約(オリジナル)

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.

arxiv情報

著者	Samir Yitzhak Gadre,Gabriel Ilharco,Alex Fang,Jonathan Hayase,Georgios Smyrnis,Thao Nguyen,Ryan Marten,Mitchell Wortsman,Dhruba Ghosh,Jieyu Zhang,Eyal Orgad,Rahim Entezari,Giannis Daras,Sarah Pratt,Vivek Ramanujan,Yonatan Bitton,Kalyani Marathe,Stephen Mussmann,Richard Vencu,Mehdi Cherti,Ranjay Krishna,Pang Wei Koh,Olga Saukh,Alexander Ratner,Shuran Song,Hannaneh Hajishirzi,Ali Farhadi,Romain Beaumont,Sewoong Oh,Alex Dimakis,Jenia Jitsev,Yair Carmon,Vaishaal Shankar,Ludwig Schmidt
発行日	2023-10-20 17:01:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DataComp: In search of the next generation of multimodal datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー