Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

要約

マルチメディアコミュニティは、マルチモーダルな事前学習済みニューラルネットワークモデルを使用して物理世界を認識し表現することに大きな関心を示しており、その中で現在、視覚言語関連 (VLP) が最も魅力的なトピックです。
しかし、1) 必須の言語知識 (意味論や構文など) が VLP 中に抽出できるかどうか、2) そのような言語知識がマルチモーダルアライメントにどのような影響を与えるか、または強化するか、という探究に特化した取り組みはほとんど行われていません。
これに応えて、ここでは意味表現や構文構造を含む包括的な言語知識が多峰性アライメントに及ぼす影響を解明することを目的としています。
具体的には、意味構造、否定ロジック、属性の所有権、および意味構造、否定ロジック、属性の所有権、および
関係の構成。
私たちが提案した精査ベンチマークに基づいて、5 つの高度な VLP モデルを総合的に分析すると、VLP モデルは次のことがわかります。
ii) 文と否定の間の組み合わせの理解が限られていることを示しています。
iii) 視覚情報内の動作や空間的関係の存在を判断する際に課題に直面し、3 つの組み合わせの正しさを検証するのに苦労しています。
ベンチマークとコードは \url{https://github.com/WangFei-2019/SNARE/} から入手できます。

要約(オリジナル)

The multimedia community has shown a significant interest in perceiving and representing the physical world with multimodal pretrained neural network models, and among them, the visual-language pertaining (VLP) is, currently, the most captivating topic. However, there have been few endeavors dedicated to the exploration of 1) whether essential linguistic knowledge (e.g., semantics and syntax) can be extracted during VLP, and 2) how such linguistic knowledge impact or enhance the multimodal alignment. In response, here we aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment. Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark, to detect the vital linguistic components, e.g., lexical, semantic, and syntax knowledge, containing four tasks: Semantic structure, Negation logic, Attribute ownership, and Relationship composition. Based on our proposed probing benchmarks, our holistic analyses of five advanced VLP models illustrate that the VLP model: i) shows insensitivity towards complex syntax structures and relies on content words for sentence comprehension; ii) demonstrates limited comprehension of combinations between sentences and negations; iii) faces challenges in determining the presence of actions or spatial relationships within visual information and struggles with verifying the correctness of triple combinations. We make our benchmark and code available at \url{https://github.com/WangFei-2019/SNARE/}.

arxiv情報

著者	Fei Wang,Liang Ding,Jun Rao,Ye Liu,Li Shen,Changxing Ding
発行日	2023-08-24 16:17:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー