Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

要約

離散オブジェクトの構成として視覚シーンをモデル化するオブジェクト中心（OC）表現は、系統的な構成一般化を実現し、推論を促進するために、さまざまな下流タスクで使用される可能性があります。
ただし、これらの主張はまだ経験的に徹底的に検証されていません。
最近、基礎モデルは、言語からコンピュータービジョンまで、多様なドメイン全体で比類のない機能を実証し、それらを幅広い計算タスクの将来の研究の潜在的な基礎として配置しています。
この論文では、シーンの正確な構成的理解が必要な、下流の視覚的質問応答（VQA）の表現学習に関する広範な実証研究を実施します。
OCモデルの利点とトレードオフ、および合成データと現実世界の両方のデータに関する大規模な事前訓練を受けた基礎モデルを含む代替アプローチを徹底的に調査し、最終的には両方のパラダイムの強みを活用する有望なパスを特定します。
600を超える下流のVQAモデルと15種類のアップストリーム表現を含む私たちの研究の拡張性も、コミュニティ全体に関心があると思われるいくつかの追加の洞察を提供します。

要約(オリジナル)

Object-centric (OC) representations, which model visual scenes as compositions of discrete objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have yet to be thoroughly validated empirically. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains, from language to computer vision, positioning them as a potential cornerstone of future research for a wide range of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, ultimately identifying a promising path to leverage the strengths of both paradigms. The extensiveness of our study, encompassing over 600 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.

arxiv情報

著者	Amir Mohammad Karimi Mamaghan,Samuele Papa,Karl Henrik Johansson,Stefan Bauer,Andrea Dittadi
発行日	2025-02-28 17:32:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー