Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

要約

現在の最も先進的なビジョン言語モデル (VLM) は、依然として独自仕様です。
最強のオープンウェイトモデルは、優れたパフォーマンスを実現するために独自の VLM からの合成データに大きく依存しており、これらのクローズド VLM をオープン VLM に効果的に蒸留します。
その結果、コミュニティには、パフォーマンスの高い VLM をゼロから構築する方法に関する基礎的な知識が欠けていました。
私たちは、オープン性のクラスで最先端の VLM の新しいファミリーである Molmo を紹介します。
私たちの主な貢献は、PixMo と呼ばれる新しいデータセットのコレクションです。これには、事前トレーニング用の非常に詳細な画像キャプションのデータセット、微調整用の自由形式画像 Q&A データセット、革新的な 2D ポインティングデータセットが含まれます。これらはすべて、
外部 VLM。
私たちのアプローチが成功するかどうかは、慎重なモデリングの選択、適切に調整されたトレーニングパイプライン、そして最も重要なことに、新しく収集されたデータセットの品質にかかっています。
当社のクラス最高の 72B モデルは、オープンウェイトおよびデータモデルのクラスで他のモデルよりも優れているだけでなく、Claude 3.5 Sonnet、Gemini 1.5 Pro、Flash などのより大きな独自モデルよりも優れており、両方の学術データに基づいて GPT-4o に次いで 2 位です。
ベンチマークと人間による大規模な評価に基づきます。
モデルの重み、新しいデータセット、ソースコードは https://molmo.allenai.org/blog で入手できます。

要約(オリジナル)

Today’s most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.

arxiv情報

著者	Matt Deitke,Christopher Clark,Sangho Lee,Rohun Tripathi,Yue Yang,Jae Sung Park,Mohammadreza Salehi,Niklas Muennighoff,Kyle Lo,Luca Soldaini,Jiasen Lu,Taira Anderson,Erin Bransom,Kiana Ehsani,Huong Ngo,YenSung Chen,Ajay Patel,Mark Yatskar,Chris Callison-Burch,Andrew Head,Rose Hendrix,Favyen Bastani,Eli VanderBilt,Nathan Lambert,Yvonne Chou,Arnavi Chheda,Jenna Sparks,Sam Skjonsberg,Michael Schmitz,Aaron Sarnat,Byron Bischoff,Pete Walsh,Chris Newell,Piper Wolters,Tanmay Gupta,Kuo-Hao Zeng,Jon Borchardt,Dirk Groeneveld,Crystal Nam,Sophie Lebrecht,Caitlin Wittlif,Carissa Schoenick,Oscar Michel,Ranjay Krishna,Luca Weihs,Noah A. Smith,Hannaneh Hajishirzi,Ross Girshick,Ali Farhadi,Aniruddha Kembhavi
発行日	2024-12-05 14:28:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー