Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

要約

物体中心（OC）表現は、視覚シーンを離散的な物体の構成としてモデル化するもので、体系的な構成汎化を達成し、推論を促進するために、様々な下流タスクで使用される可能性がある。しかし、これらの主張は、まだ実証的に十分に検証されていない。近年、基礎モデルは、言語からコンピュータビジョンに至る多様な領域にわたって比類ない能力を実証しており、幅広い計算タスクのための将来の研究の礎となる可能性を位置づけている。本論文では、シーンの正確な構成的理解を必要とする、下流の視覚的質問応答（Visual Question Answering: VQA）のための表現学習に関する広範な実証研究を行う。合成データと実世界データの両方を用いて、OCモデルと、事前に訓練された大規模な基礎モデルを含む代替アプローチの利点とトレードオフを徹底的に調査し、最終的に両方のパラダイムの長所を活用する有望な道を特定する。600以上のダウンストリームVQAモデルと15の異なるタイプのアップストリーム表現を網羅する我々の研究の広範さは、コミュニティ全体にとって興味深いと思われるいくつかの追加的な洞察も提供する。

要約(オリジナル)

Object-centric (OC) representations, which model visual scenes as compositions of discrete objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have yet to be thoroughly validated empirically. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains, from language to computer vision, positioning them as a potential cornerstone of future research for a wide range of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, ultimately identifying a promising path to leverage the strengths of both paradigms. The extensiveness of our study, encompassing over 600 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.

arxiv情報

著者	Amir Mohammad Karimi Mamaghan,Samuele Papa,Karl Henrik Johansson,Stefan Bauer,Andrea Dittadi
発行日	2025-03-03 11:48:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー