Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in Large Language Models

要約

大規模言語モデル (LLM) は、テキスト内に広く普及している数値を差分的に表現しません。
対照的に、神経科学の研究では、数字と単語の異なる神経表現が特定されています。
この研究では、人気のある LLM が行動レンズから数値の大きさ (例: $4 < 5$) をどの程度うまく捉えているかを調査します。 LLM の表現能力に関する以前の研究では、LLM が人間レベルのパフォーマンス (標準ベンチマークで全体的に高い精度など) を示すかどうかが評価されています。ここで、認知科学に触発された別の質問をします。LLM の数値表現は、通常、距離、サイズ、比率の効果を示す人間の言語使用者の数値表現とどの程度一致していますか? 私たちは、リンク仮説に基づいて、数字と数字のモデル埋め込み間の類似性を人間の応答時間にマッピングします。その結果、人間の脳にはこれらの表現を直接サポートする神経回路が存在しないにもかかわらず、さまざまなアーキテクチャの言語モデルにわたって、驚くほど人間に似た表現が存在することが明らかになりました。この研究は、行動ベンチマークを使用して LLM を理解することの有用性を示し、LLM の数値表現とその認知的妥当性に関する今後の研究への道を示しています。

要約(オリジナル)

Large Language Models (LLMs) do not differentially represent numbers, which are pervasive in text. In contrast, neuroscience research has identified distinct neural representations for numbers and words. In this work, we investigate how well popular LLMs capture the magnitudes of numbers (e.g., that $4 < 5$) from a behavioral lens. Prior research on the representational capabilities of LLMs evaluates whether they show human-level performance, for instance, high overall accuracy on standard benchmarks. Here, we ask a different question, one inspired by cognitive science: How closely do the number representations of LLMscorrespond to those of human language users, who typically demonstrate the distance, size, and ratio effects? We depend on a linking hypothesis to map the similarities among the model embeddings of number words and digits to human response times. The results reveal surprisingly human-like representations across language models of different architectures, despite the absence of the neural circuitry that directly supports these representations in the human brain. This research shows the utility of understanding LLMs using behavioral benchmarks and points the way to future work on the number representations of LLMs and their cognitive plausibility.

arxiv情報

著者	Raj Sanjay Shah,Vijay Marupudi,Reba Koenen,Khushi Bhardwaj,Sashank Varma
発行日	2023-11-08 12:39:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー