Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment

要約

音声ベースのAI開発は、言語情報と麻痺情報の両方を処理する際にユニークな課題に直面しています。
この研究では、オーディオ言語モデル（LALMS）と人間が音声理解中にスピーカーの特性を統合する方法を比較し、LALMSが人間の認知メカニズムを並行する方法でスピーカーコンテキスト化された言語を処理するかどうかを尋ねます。
2つのラルム（QWEN2-AUDIOおよびULTRAVOX 0.5）の処理パターンをヒトEEG応答と比較しました。
モデルからの驚きとエントロピーメトリックを使用して、社会的ステレオタイプの違反（例えば、定期的にマニキュアを取得すると主張する男性）と生物学的知識違反（例えば、妊娠していると主張する男性）にわたるスピーカーコンセントの不一致に対する感受性を分析しました。
結果は、QWEN2-Audioがスピーカーに及ぼす内容の驚きを示し、その驚くべき値はヒトN400応答を有意に予測し、Ultravox 0.5はスピーカーの特性に対する感度が限られていることを明らかにしました。
重要なことに、どちらのモデルも、社会的違反（N400効果を誘発する）と生物学的違反（P600効果を誘発する）の間の人間のような処理の区別を再現しなかったことです。
これらの発見は、スピーカーコンテキスト化言語の処理における現在のLALMの潜在性と制限の両方を明らかにし、人間とLALMの間の社会言語処理メカニズムの違いを示唆しています。

要約(オリジナル)

Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs’ (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.

arxiv情報

著者	Hanlin Wu,Xufeng Duan,Zhenguang Cai
発行日	2025-03-25 12:10:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー