With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models

要約

最近、ラージ言語モデル (LLM) とビジョン言語モデル (VLM) が、心理言語現象をテストする実験における人間の参加者の潜在的な代替品としての適性を実証しました。
しかし、十分に研究されていない疑問は、視覚とテキストモダリティのみにアクセスできるモデルが、正書法と画像のみからの抽象的な推論を通じて、音ベースの現象をどの程度暗黙的に理解できるかということです。
これを調査するために、私たちは、VLMとLLMが音の象徴性を実証する能力（つまり、音と概念の間の任意ではないつながりを認識する能力）と、言語モジュールと視覚モジュールの相互作用を介して「聞く」能力を分析します。
オープンソースとクローズドソースのマルチモーダルモデル。
私たちは、古典的なキキ・ブーバとミル・マルの形状と大きさの象徴性タスクを再現したり、言語的象徴性に関する人間の判断をLLMの判断と比較したりするなど、複数の実験を実行します。
我々の結果は、VLM が人間のラベルとさまざまなレベルの一致を示しており、インシリコ実験では VLM と人間の対応物に対してより多くのタスク情報が必要になる可能性があることを示しています。
さらに、最大一致レベルが高いことから、大きさの象徴性は形状の象徴性よりも VLM にとって識別しやすいパターンであり、言語の象徴性の理解はモデルのサイズに大きく依存していることがわかります。

要約(オリジナル)

Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated aptitude as potential substitutes for human participants in experiments testing psycholinguistic phenomena. However, an understudied question is to what extent models that only have access to vision and text modalities are able to implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone. To investigate this, we analyse the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise a non-arbitrary link between sounds and concepts) as well as their ability to ‘hear’ via the interplay of the language and vision modules of open and closed-source multimodal models. We perform multiple experiments, including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks, and comparing human judgements of linguistic iconicity with that of LLMs. Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation. We additionally see through higher maximum agreement levels that Magnitude Symbolism is an easier pattern for VLMs to identify than Shape Symbolism, and that an understanding of linguistic iconicity is highly dependent on model size.

arxiv情報

著者	Tyler Loakman,Yucheng Li,Chenghua Lin
発行日	2024-10-18 15:42:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー