Do Large Language Models Judge Error Severity Like Humans?

要約

大規模な言語モデル（LLM）は、自然言語生成の自動評価者としてますます使用されていますが、エラーの重大度の人間の判断を正確に複製できるかどうかは不明のままです。
この研究では、制御されたセマンティックエラーを含む画像記述のヒトとLLMの評価を体系的に比較します。
Van Miltenburgらの実験的枠組みを拡張します。
（2020）Unimodal（テキストのみ）およびマルチモーダル（テキスト +画像）設定の両方に、年齢、性別、衣類の種類、衣類の色の4つのエラータイプを評価します。
私たちの調査結果は、人間がさまざまなエラータイプにさまざまなレベルの重症度を割り当て、視覚的なコンテキストが色とタイプのエラーの知覚された重大度を大幅に増幅することを明らかにしています。
特に、ほとんどのLLMは、性別エラーに低いスコアを割り当てますが、非常に深刻であると判断するが異なる理由で両方を判断する人間とは異なり、色のエラーに対して不釣り合いに高いスコアを割り当てます。
これは、これらのモデルがジェンダーの判断に影響を与える社会的規範を内面化したかもしれないが、異なる神経メカニズムによって形作られる色に対する人間の感受性をエミュレートするための知覚的な基盤を欠いていることを示唆しています。
評価されたLLMSの1つであるDoubaoは、エラーの重大度の人間のようなランキングを複製しますが、人間と同じくらい明確にエラータイプを区別できません。
驚くべきことに、ユニモーダルLLMであるDeepSeek-V3は、単峰性とマルチモーダルの両方の条件で人間の判断と最高の整合性を達成し、最先端のマルチモーダルモデルを上回ります。

要約(オリジナル)

Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.

arxiv情報

著者	Diege Sun,Guanyi Chen,Fan Zhao,Xiaorong Cheng,Tingting He
発行日	2025-06-05 15:24:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do Large Language Models Judge Error Severity Like Humans?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー