Temperature-scaling surprisal estimates improve fit to human reading times — but does it do so for the ‘right reasons’?

要約

人間の言語処理の難易度は、情報理論的な尺度であるsurprisal（文脈における単語の負の対数確率）によって予測されることが、多くの証拠によって示されている。しかし、人間の言語処理難易度を予測するために必要なこれらの確率を、どのように推定するのが最適なのかはまだ不明である。長年の信念では、より低い当惑度を持つモデルが、単語の予測可能性をより正確に推定し、その結果、より優れた読解時間の予測につながると考えられてきたが、最近の研究では、非常に大きなモデルでは、心理言語学的予測力が低下することが示されている。その理由の一つは、言語モデルは人間よりも予測に自信を持っている可能性があることである。本論文では、大規模言語モデル（LLM）の予測値の温度スケーリングが、英文の読解時間に対する驚き推定値とその予測力にどのような影響を与えるかを検証する。第一に、大規模言語モデルのキャリブレーションは一般的にモデルサイズとともに改善されることを示す。次に、温度スケーリング確率は、複数の読書時間コーパスにおいて、読書時間への適合を系統的に向上させる（デルタ対数尤度で最大89％の改善）ことを見出す。最後に、この適合度の向上は、主に複数のサブワードトークンで構成される単語によってもたらされることを示す。

要約(オリジナル)

A wide body of evidence shows that human language processing difficulty is predicted by the information-theoretic measure surprisal, a word’s negative log probability in context. However, it is still unclear how to best estimate these probabilities needed for predicting human processing difficulty — while a long-standing belief held that models with lower perplexity would provide more accurate estimates of word predictability, and therefore lead to better reading time predictions, recent work has shown that for very large models, psycholinguistic predictive power decreases. One reason could be that language models might be more confident of their predictions than humans, because they have had exposure to several magnitudes more data. In this paper, we test what effect temperature-scaling of large language model (LLM) predictions has on surprisal estimates and their predictive power of reading times of English texts. Firstly, we show that calibration of large language models typically improves with model size, i.e. poorer calibration cannot account for poorer fit to reading times. Secondly, we find that temperature-scaling probabilities lead to a systematically better fit to reading times (up to 89% improvement in delta log likelihood), across several reading time corpora. Finally, we show that this improvement in fit is chiefly driven by words that are composed of multiple subword tokens.

arxiv情報

著者	Tong Liu,Iza Škrjanec,Vera Demberg
発行日	2024-07-03 16:12:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Temperature-scaling surprisal estimates improve fit to human reading times — but does it do so for the ‘right reasons’?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー