2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

要約

画像-テキストペアデータと比較して、インターリーブコーパスは、視覚言語モデル（VLM）が人間のように世界をより自然に理解することを可能にする。しかし、このような既存のデータセットは、ウェブページからクロールされたものであり、知識密度が低い、画像とテキストの関係が緩い、画像間の論理的一貫性が低いなどの課題に直面している。一方、インターネット上には、人間が基礎的な科目を学習するために広く利用されている膨大な教育ビデオ（例えば、オンライン幾何学コース）が存在するが、これらの貴重なリソースは、VLM学習において十分に利用されていない。本論文では、VLMの事前トレーニングのために、より豊富な基礎知識を持つ高品質な教科書コーパスを紹介します。このコーパスは、2年半以上、合計22,000時間の授業ビデオを収集したものである。まず、LLMが提案した分類法を用いて、系統的に授業ビデオを収集する。次に、動画から視覚（キーフレーム）、音声（ASR）、テキスト知識（OCR）を漸進的に抽出・精緻化し、時間的順序に基づいて画像・テキスト・インターリーブ・コーパスとして整理する。ビデオ中心の教科書は、他の教科書と比較して、より首尾一貫した文脈、より豊富な知識、より優れた画像とテキストのアライメントを提供する。実験では、特にScienceQAやMathVistaのような知識と推論を多用するタスクにおいて、VLMの優れた事前学習性能が実証された。さらに、我々の教科書で事前訓練されたVLMは、優れたインターリーブ文脈認識を示し、タスク解決のために数ショット文脈の視覚的・テキスト的手がかりを活用する。

要約(オリジナル)

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnote{Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}}.

arxiv情報

著者	Wenqi Zhang,Hang Zhang,Xin Li,Jiashuo Sun,Yongliang Shen,Weiming Lu,Deli Zhao,Yueting Zhuang,Lidong Bing
発行日	2025-01-03 13:25:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー