Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

要約

タイトル： Multimodal C4：テキストと相互に交錯した億級の画像コーパス

要約：

– Flamingoなどの文脈内のビジョンと言語モデルは、画像とテキストの任意に交錯したシーケンスを入力としてサポートします。
– この形式は、独立した監視された（画像、テキスト）の例を交互にして、few-shot learningを実現するだけでなく、より複雑な画像間の相互作用を含んだより複雑なプロンプトも可能にします。
– これをサポートするために、同様に画像+テキストを含むWebコーパスでプレトレーニングが行われます。
– しかしながら、これらの形式の大規模なデータは公に利用可能になっていませんでした。
– Multimodal C4（mmc4）をリリースし、人気のあるテキスト専用のc4コーパスに画像を交互につけたものです。
– CLIP機能を使用して画像を長いテキスト本文に配置するために線形割り当てアルゴリズムを使用しているため、よりよい結果を出しています。
– mmc4は、料理、旅行、テクノロジーなどの日常的なトピックをカバーしています。
– ランダムに選ばれたドキュメントの手動検査により、画像の大多数（90％）がトピックに関連しており、線形割り当てが各画像に特に適合する個々の文を頻繁に選択していることが示されました（78％）。
– NSFW画像、広告などをフィルタリングした後、このコーパスには、43Bの英語トークンを含む585Mの画像が交互に現れる103Mの文書が含まれています。

要約(オリジナル)

In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., ‘What do image A and image B have in common?’ To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. mmc4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (90%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (78%). After filtering NSFW images, ads, etc., the corpus contains 103M documents containing 585M images interleaved with 43B English tokens.

arxiv情報

著者	Wanrong Zhu,Jack Hessel,Anas Awadalla,Samir Yitzhak Gadre,Jesse Dodge,Alex Fang,Youngjae Yu,Ludwig Schmidt,William Yang Wang,Yejin Choi
発行日	2023-04-14 06:17:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー