Unmasking the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

要約

このペーパーでは、マスクされたテキストを処理する能力を厳密に評価することにより、ラージ言語モデル (LLM) の限界を明らかにします。
RealtimeQA のようなマスクされた質問応答データセットの推論を測定する MskQA と、マスクされた算術問題の数値的推論を評価する MskCal の 2 つの新しいタスクを紹介します。GPT-4o と 4o-mini をテストすると、LLM はマスクされたテキストに対してある程度の回復力を示す一方で、その
パフォーマンスはマスキング率とセマンティックキューに大きく左右されます。
具体的には、セマンティックな手がかりがまったく存在しない「ソリッドマスキング」は、一部のセマンティック情報が保持される「部分リフティング」と比較して大幅なパフォーマンスの低下につながり、LLM が表面レベルのパターンに依存していることを示しています。
興味深いことに、GPT-4o は、特に MskCal において常に 4o-mini より優れたパフォーマンスを示し、マスクされたテキストを使用した数的推論を処理する能力が優れていることを示しています。
これは、LLM の推論プロセスにおける意味論的手がかりの重要な役割を強調しています。
私たちの研究は、マスクされたテキスト処理における背景知識と推論能力の間の相互作用を明らかにし、LLM の機能と制限をより深く理解するための道を開き、真の理解能力を正確に評価するためのより堅牢な評価方法の必要性を強調しています。

要約(オリジナル)

This paper sheds light on the limitations of Large Language Models (LLMs) by rigorously evaluating their ability to process masked text. We introduce two novel tasks: MskQA, measuring reasoning on masked question-answering datasets like RealtimeQA, and MskCal, assessing numerical reasoning on masked arithmetic problems.Testing GPT-4o and 4o-mini reveals that while LLMs exhibit some resilience to masked text, their performance is highly contingent on masking rates and semantic cues. Specifically, ‘solid masking,’ where semantic clues are entirely absent, leads to a significant performance drop compared to ‘partial lifting,’ where some semantic information is retained, indicating LLMs’ reliance on surface-level patterns. Interestingly, GPT-4o consistently outperforms 4o-mini, particularly in MskCal, demonstrating a greater ability to handle numerical reasoning with masked text. This underscores the crucial role of semantic cues in the reasoning process of LLMs. Our study illuminates the interplay between background knowledge and reasoning ability in masked text processing, paving the way for a deeper understanding of LLM capabilities and limitations, and highlighting the need for more robust evaluation methods to accurately assess their true comprehension abilities.

arxiv情報

著者	Fuka Matsuzaki,Haru-Tada Sato
発行日	2024-11-08 16:07:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unmasking the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー