Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

要約

米国大統領は誰ですか？
答えは、質問がいつ尋ねられるかによって変わります。
大規模な言語モデル（LLM）はさまざまな推論タスクで評価されますが、多くの場合、重要な次元：時間を見逃します。
現実世界のシナリオでは、答えの正しさは、一時的なコンテキストに頻繁に結び付けられています。
このギャップに対処するために、2018年から2024年にかけて8,000を超えるイベントにまたがる新しいフレームワークとデータセットを提示します。デイレベルの粒度が注釈が付けられ、政治、科学、ビジネスなどのドメイン全体でグローバルに調達されています。
私たちのタイムシフト評価方法は、時間的推論のためにLLMを体系的にプローブし、ベースモデルがしばしば時間依存のリコールで命令チューニングと合成訓練を受けた対応物を上回ることを明らかにします。
さらに、大規模なモデルでさえ、言い換えされた事実を処理する際に脆弱性を示し、時間的一貫性における未解決の課題を強調していることがわかります。
これらの制限を特定することにより、私たちの仕事は、現実世界の知識の動的な性質に適応できる時期式言語モデルを進めるための重要なステップを提供します。

要約(オリジナル)

Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.

arxiv情報

著者	David Herel,Vojtech Bartek,Jiri Jirak,Tomas Mikolov
発行日	2025-05-15 14:13:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー