PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

要約

ビジョン言語モデルはコンピュータービジョンの研究に不可欠ですが、多くの高性能モデルは閉鎖されたままであり、データ、設計、トレーニングのレシピを不明瞭にしています。
研究コミュニティは、測定可能な科学的進歩を犠牲にして、ブラックボックスモデルの蒸留を使用してトレーニングデータにラベルを付け、強力なベンチマーク結果を達成することで対応しました。
ただし、教師モデルの詳細とそのデータソースを知らずに、科学的進歩は測定が困難なままです。
この論文では、画像とビデオの理解における透明な研究のための完全にオープンで再現可能なフレームワークで、知覚言語モデル（PLM）の構築を研究しています。
独自のモデルから蒸留せずに標準のトレーニングパイプラインを分析し、特に詳細なビデオ理解において、重要なデータギャップを特定するために大規模な合成データを調査します。
これらのギャップを橋渡しするために、2.8mの人間に標識されたファインのビデオ質問のペアと空間的に接地されたビデオキャプションのインスタンスをリリースします。
さらに、「what」、「where」、「 ‘when」、「and’ of ‘of’」を推論する能力に焦点を当てた挑戦的なビデオ理解タスクを評価するためのスイートであるPLM-videobenchを紹介します。
データ、トレーニングレシピ、コード、モデルを提供することにより、作業を完全に再現可能にします。

要約(オリジナル)

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about ‘what’, ‘where’, ‘when’, and ‘how’ of a video. We make our work fully reproducible by providing data, training recipes, code & models.

arxiv情報

著者	Jang Hyun Cho,Andrea Madotto,Effrosyni Mavroudi,Triantafyllos Afouras,Tushar Nagarajan,Muhammad Maaz,Yale Song,Tengyu Ma,Shuming Hu,Suyog Jain,Miguel Martin,Huiyu Wang,Hanoona Rasheed,Peize Sun,Po-Yao Huang,Daniel Bolya,Nikhila Ravi,Shashank Jain,Tammy Stark,Shane Moon,Babak Damavandi,Vivian Lee,Andrew Westbury,Salman Khan,Philipp Krähenbühl,Piotr Dollár,Lorenzo Torresani,Kristen Grauman,Christoph Feichtenhofer
発行日	2025-04-17 17:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー