What If We Recaption Billions of Web Images with LLaMA-3?

要約

Web クロールされた画像とテキストのペアは本質的にノイズが多くなります。
これまでの研究では、これらのペアのテキスト記述を意味的に調整し、強化することで、さまざまな視覚言語タスク、特にテキストから画像への生成にわたるモデルのトレーニングを大幅に強化できることが実証されています。
ただし、この分野の大規模調査は依然としてクローズドソースのままです。
私たちの論文は、GPT-4 レベルの LLM である強力な \textit{オープンソース} LLaMA-3 を活用して、このコミュニティの取り組みの橋渡しをすることを目的としています。
私たちの再キャプションパイプラインはシンプルです。まず、LLaMA-3-8B を利用した LLaVA-1.5 を微調整し、次にそれを使用して DataComp-1B データセットから 13 億枚の画像を再キャプションします。
私たちの実証結果は、この強化されたデータセットである Recap-DataComp-1B が、高度な視覚言語モデルのトレーニングに大きな利点を提供することを確認しています。
CLIP のような識別モデルでは、クロスモーダル検索タスクにおけるゼロショットパフォーマンスの向上が観察されます。
テキストから画像への拡散トランスフォーマーのような生成モデルの場合、生成された画像は、特に複雑なクエリに続く場合に、ユーザーのテキスト指示との整合性が大幅に向上しています。
私たちのプロジェクトページは https://www.haqtu.me/Recap-Datacomp-1B/ です。

要約(オリジナル)

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users’ text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

arxiv情報

著者	Xianhang Li,Haoqin Tu,Mude Hui,Zeyu Wang,Bingchen Zhao,Junfei Xiao,Sucheng Ren,Jieru Mei,Qing Liu,Huangjie Zheng,Yuyin Zhou,Cihang Xie
発行日	2024-06-12 17:59:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

What If We Recaption Billions of Web Images with LLaMA-3?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー