jarxiv | Japanese arxiv | ページ 320

Linear $Q$-Learning Does Not Diverge in $L^2$: Convergence Rates to a Bounded Set

投稿日: 2025年5月28日作成者: jarxiv

要約

$ Q $ -Learningは、最も基本的な強化学習アルゴリズムの1つです。
線形関数近似（つまり、線形$ q $ -Learning）を使用した$ q $ -Learningは、最近の作業Meyn（2024）が線形$ q $ elechningの反復の究極のほぼ確実な境界を確立するまで、可能性のある発散に苦しむと広く信じられています。
この成功に基づいて、このペーパーでは、線形$ Q $ eLearningの最初の$ l^2 $収束率（境界セットへ）をさらに確立します。
Meyn（2024）と同様に、元の線形$ Q $ -Learningアルゴリズムを変更せず、Bellmanの完全性の仮定を行わず、行動ポリシーにほぼ最適性の仮定を行いません。
必要なのは、適応温度の$ \ epsilon $ -SOFTMAXの動作ポリシーだけです。
分析の鍵は、急速に変化する遷移関数を備えたマルコフのノイズの下での確率的近似の一般的な結果です。
また、サイド製品として、この一般的な結果を使用して、$ \ epsilon $ -softmaxの行動ポリシーを使用した表の$ q $ c $ -Learningの$ l^2 $収束率を確立します。

要約(オリジナル)

$Q$-learning is one of the most fundamental reinforcement learning algorithms. It is widely believed that $Q$-learning with linear function approximation (i.e., linear $Q$-learning) suffers from possible divergence until the recent work Meyn (2024) which establishes the ultimate almost sure boundedness of the iterates of linear $Q$-learning. Building on this success, this paper further establishes the first $L^2$ convergence rate of linear $Q$-learning iterates (to a bounded set). Similar to Meyn (2024), we do not make any modification to the original linear $Q$-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an $\epsilon$-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the $L^2$ convergence rate of tabular $Q$-learning with an $\epsilon$-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.

arxiv情報

著者	Xinyu Liu,Zixuan Xie,Shangtong Zhang
発行日	2025-05-27 16:10:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, stat.ML | コメントを受け付けていません

DeSocial: Blockchain-based Decentralized Social Networks

投稿日: 2025年5月28日作成者: jarxiv

要約

Web 2.0ソーシャルプラットフォームは本質的に集中化されており、ユーザーデータとアルゴリズムの決定がプラットフォームによって制御されています。
ただし、ユーザーは、基礎となるアルゴリズムを選択することなく、社会的予測を受動的に受信することができます。これにより、パーソナライズが制限されます。
幸いなことに、ブロックチェーンの出現により、ユーザーはローカルの状況に合わせて調整されたアルゴリズムを選択することができ、予測によりパーソナライズされた方法で改善されます。
ブロックチェーン環境では、各ユーザーはソーシャル予測を実行する独自のモデルを所有しており、社会的相互作用に関するさまざまな視点をキャプチャします。
私たちの仕事では、分散データストレージ、ノードレベルのコンセンサス、ユーザー主導のモデル選択を統合するイーサリアム（ETH）ローカル開発チェーンに展開された分散型ソーシャルネットワーク学習フレームワークであるDeSocialを提案します。
最初の段階では、各ユーザーはDeSocialを活用して、ローカルサブグラフの複数のバックボーンモデルを評価します。
Desocialは実行を調整し、モデルごとの予測結果を返し、ユーザーがパーソナライズされたソーシャル予測に最適なバックボーンを選択できるようにします。
次に、Desocialは、各ユーザーが指定したアルゴリズムを所有するいくつかの検証ノードを均一に選択し、単一のモデルの誤判断によって引き起こされるエラーを防ぐために、多数決による予測結果を集約します。
広範な実験は、DeSocialが5つの古典的な集中化されたソーシャルネットワーク学習モデルと比較して明らかな改善があり、ブロックチェーンベースの分散型ソーシャルネットワークでのユーザーエンパワーメントを促進し、ブロックチェーンに基づくマルチノード検証とパーソナライズされたアルゴリズム選択の重要性を示しています。
実装は、https：//github.com/agiresearch/desocialで入手できます。

要約(オリジナル)

Web 2.0 social platforms are inherently centralized, with user data and algorithmic decisions controlled by the platform. However, users can only passively receive social predictions without being able to choose the underlying algorithm, which limits personalization. Fortunately, with the emergence of blockchain, users are allowed to choose algorithms that are tailored to their local situation, improving prediction results in a personalized way. In a blockchain environment, each user possesses its own model to perform the social prediction, capturing different perspectives on social interactions. In our work, we propose DeSocial, a decentralized social network learning framework deployed on an Ethereum (ETH) local development chain that integrates distributed data storage, node-level consensus, and user-driven model selection through Ganache. In the first stage, each user leverages DeSocial to evaluate multiple backbone models on their local subgraph. DeSocial coordinates the execution and returns model-wise prediction results, enabling the user to select the most suitable backbone for personalized social prediction. Then, DeSocial uniformly selects several validation nodes that possess the algorithm specified by each user, and aggregates the prediction results by majority voting, to prevent errors caused by any single model’s misjudgment. Extensive experiments show that DeSocial has an evident improvement compared to the five classical centralized social network learning models, promoting user empowerment in blockchain-based decentralized social networks, showing the importance of multi-node validation and personalized algorithm selection based on blockchain. Our implementation is available at: https://github.com/agiresearch/DeSocial.

arxiv情報

著者	Jingyuan Huang,Xi Zhu,Minghao Guo,Yongfeng Zhang
発行日	2025-05-27 16:17:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, cs.SI | コメントを受け付けていません

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

投稿日: 2025年5月28日作成者: jarxiv

要約

線形TD（$ \ lambda $）は、ポリシー評価のための最も基本的な強化学習アルゴリズムの1つです。
以前は、収束率は通常、線形独立した特徴の仮定の下で確立されていますが、これは多くの実際のシナリオでは保持されません。
代わりに、このペーパーでは、アルゴリズムの変更や追加の仮定を行うことなく、任意の機能の下で動作する線形TD（$ \ lambda $）の最初の$ l^2 $収束率を確立します。
私たちの結果は、割引と平均の報酬設定の両方に適用されます。
任意の特徴に起因するソリューションの潜在的な非独自性に対処するために、単一のポイントではなくソリューションセットへの収束速度を特徴とする新しい確率的近似結果を開発します。

要約(オリジナル)

Linear TD($\lambda$) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first $L^2$ convergence rates for linear TD($\lambda$) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.

arxiv情報

著者	Zixuan Xie,Xinyu Liu,Rohan Chandra,Shangtong Zhang
発行日	2025-05-27 16:17:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, stat.ML | コメントを受け付けていません

Leveraging the Power of Conversations: Optimal Key Term Selection in Conversational Contextual Bandits

投稿日: 2025年5月28日作成者: jarxiv

要約

会話型推奨システムは、関連する「重要な用語」を持つユーザーを積極的に照会し、フィードバックを活用して、パーソナライズされた推奨事項に対するユーザーの好みを引き出します。
このドメインで一般的なアプローチである会話の文脈的盗賊は、搾取と探査のバランスをとることにより、好みの学習を最適化することを目指しています。
ただし、いくつかの制限は、実際のシナリオでの有効性を妨げています。
第一に、既存のアルゴリズムは、探索が不十分な主要な用語選択戦略を採用しており、ユーザーの好みを完全にプローブすることに失敗し、最適ではない好みの推定をもたらします。
第二に、現在のアルゴリズムは通常、決定論的ルールに依存して会話を開始し、好みが十分に理解されている場合に不必要な相互作用を引き起こし、好みが不確実な場合は機会を逃します。
これらの制限に対処するために、Clisk、Clime、およびClisk-Meの3つの新しいアルゴリズムを提案します。
Cliskは、好みの学習における探索を強化するためにスムーズな主要な用語のコンテキストを導入し、Climeは好みの不確実性に基づいて会話を適応的に開始し、Clisk-Meは両方の手法を統合します。
3つのアルゴリズムすべてが$ o（\ sqrt {dt \ log {t}}）$のより厳しい後悔の上限を達成し、既存の方法を改善することを実現します。
さらに、会話の盗賊には、一致する下限$ \ omega（\ sqrt {dt}）$を提供し、アルゴリズムが最適ではないことを示しています。
合成データセットと現実世界の両方のデータセットの両方での広範な評価は、私たちのアプローチが累積後悔の少なくとも14.6％の改善を達成することを示しています。

要約(オリジナル)

Conversational recommender systems proactively query users with relevant ‘key terms’ and leverage the feedback to elicit users’ preferences for personalized recommendations. Conversational contextual bandits, a prevalent approach in this domain, aim to optimize preference learning by balancing exploitation and exploration. However, several limitations hinder their effectiveness in real-world scenarios. First, existing algorithms employ key term selection strategies with insufficient exploration, often failing to thoroughly probe users’ preferences and resulting in suboptimal preference estimation. Second, current algorithms typically rely on deterministic rules to initiate conversations, causing unnecessary interactions when preferences are well-understood and missed opportunities when preferences are uncertain. To address these limitations, we propose three novel algorithms: CLiSK, CLiME, and CLiSK-ME. CLiSK introduces smoothed key term contexts to enhance exploration in preference learning, CLiME adaptively initiates conversations based on preference uncertainty, and CLiSK-ME integrates both techniques. We theoretically prove that all three algorithms achieve a tighter regret upper bound of $O(\sqrt{dT\log{T}})$ with respect to the time horizon $T$, improving upon existing methods. Additionally, we provide a matching lower bound $\Omega(\sqrt{dT})$ for conversational bandits, demonstrating that our algorithms are nearly minimax optimal. Extensive evaluations on both synthetic and real-world datasets show that our approaches achieve at least a 14.6% improvement in cumulative regret.

arxiv情報

著者	Maoli Liu,Zhuohua Li,Xiangxiang Dai,John C. S. Lui
発行日	2025-05-27 16:22:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG | コメントを受け付けていません

Foundation Models on a Budget: Approximating Blocks in Large Vision Models

投稿日: 2025年5月28日作成者: jarxiv

要約

ファンデーションモデルは、さまざまなタスクやドメインで印象的なパフォーマンスを示していますが、大規模な計算リソースが必要であり、アクセシビリティと持続可能性に関する懸念を高めています。
ファンデーションモデルのサイズを縮小する以前の試みは、追加のトレーニングステップを通じて計算負荷が増加することになっているため、問題に完全に対処することはできません。
最近の作品は、深いニューラルネットワークが内部表現の類似性を示すことを明らかにしています。
ネットワーク間の類似点により、モデルのステッチやマージなどの手法が有効になっていますが、ネットワーク内の類似点は効率を改善するために既存の依存のままです。
このホワイトペーパーでは、変圧器ブロック近似（TBA）を提案します。これは、ネットワーク内の類似性を活用して、大型視覚モデルの変圧器ブロックを識別し、近似する新しい方法です。
TBAは、モデルの残りの部分を再訓練または微調整することなく、軽量の閉形型変換を使用してこれらのブロックを置き換えます。
提案された方法は、下流タスクへの影響を最小限に抑えながら、パラメーターの数を減らします。
TBAの有効性と一般化可能性を、複数のデータセット（例：Imagenet-1KおよびCIFAR100）と最先端の優先視覚モデル（例：vit、dino-v2、deit）にわたる広範な実験を検証します。

要約(オリジナル)

Foundation Models have shown impressive performance in various tasks and domains, yet they require massive computational resources, raising concerns about accessibility and sustainability. Previous attempts to reduce foundation model size fall short of fully addressing the problem, as they end up increasing computational load through additional training steps. Recent works reveal that deep neural networks exhibit internal representation similarities. While inter-network similarities have enabled techniques such as model stitching and merging, intra-network similarities remain underexplored for improving efficiency. In this paper, we propose Transformer Blocks Approximation (TBA), a novel method that leverages intra-network similarities to identify and approximate transformer blocks in large vision models. TBA replaces these blocks using lightweight, closed-form transformations, without retraining or fine-tuning the rest of the model. The proposed method reduces the number of parameters while having minimal impact on the downstream task. We validate the effectiveness and generalizability of TBA through extensive experiments across multiple datasets (e.g., Imagenet-1k and CIFAR100) and state-of-the-art pretrained vision models (e.g, ViT, DiNO-v2, and DEiT).

arxiv情報

著者	Irene Cannistraci,Simone Antonelli,Emanuele Palumbo,Thomas M. Sutter,Emanuele Rodolà,Bastian Rieck,Julia E. Vogt
発行日	2025-05-27 16:22:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG | コメントを受け付けていません

Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science

投稿日: 2025年5月28日作成者: jarxiv

要約

大規模な言語モデル（LLMS）の最近の進歩は、新しい研究のアイデアを生み出すことに有望であることを示しています。
ただし、これらのアイデアは、多くの場合、実現可能性と期待される有効性に関連する課題に直面しています。
このペーパーでは、アイデア生成プロセス中に関連するデータを使用してLLMを増強することで、生成されたアイデアの品質を高めることができる方法について説明します。
データを組み込む方法の2つの方法を紹介します。（1）Idea生成段階でメタデータを提供して、実現可能な方向にLLMを導くため、（2）アイデア選択段階で自動検証を追加して、アイデア内の仮説の経験的妥当性を評価します。
私たちは、特に気候交渉のトピックを使用して、社会科学の領域で実験を実施し、メタデータは生成されたアイデアの実現可能性を20％改善し、自動検証により選択されたアイデアの全体的な品質が7％向上することがわかります。
人間の研究では、LLMが生成したアイデアは、関連するデータと検証プロセスとともに、研究者がより高い品質の研究アイデアを提案するよう促していることを示しています。
私たちの研究は、データ駆動型の研究アイデア生成の可能性を強調し、実際の学術環境におけるLLM支援のアイデアの実用的な有用性を強調しています。

要約(オリジナル)

Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.

arxiv情報

著者	Xiao Liu,Xinyi Dong,Xinyang Gao,Yansong Feng,Xun Pang
発行日	2025-05-27 16:23:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CY, cs.HC | コメントを受け付けていません

Leveraging Large Language Models for Active Merchant Non-player Characters

投稿日: 2025年5月28日作成者: jarxiv

要約

現在のマーチャント非プレイヤーキャラクター（NPC）の受動性につながる2つの重要な問題を強調します：価格設定とコミュニケーション。
アクティブなNPCとの没入型の相互作用は焦点でしたが、マーチャントNPCとプレーヤーの間の価格交渉は未熟なままです。
第一に、受動的な価格設定とは、事前定義されたアイテム価格を変更する商人の限られた能力を指します。
第二に、パッシブコミュニケーションとは、商人がスクリプト化された方法でプレイヤーとのみ対話できることを意味します。
これらの問題に取り組み、アクティブマーチャントNPCを作成するために、鑑定モジュールとネゴシエーターモジュールで構成されるMARTと呼ばれる大規模な言語モデル（LLMS）に基づいたマーチャントフレームワークを提案します。
さまざまなトレーニング方法とLLMサイズの下でさまざまな実装オプションを探索するために、可能なゲーム環境の範囲を考慮して、2つの実験を実施しました。
我々の調査結果は、監視された微調整（SFT）や知識蒸留（KD）などの微調整方法が、より小さなLLMを使用してアクティブなマーチャントNPCを実装するのに効果的であることを示しています。
さらに、LLMSの応答から生じる3つの不規則な症例が見つかりました。

要約(オリジナル)

We highlight two significant issues leading to the passivity of current merchant non-player characters (NPCs): pricing and communication. While immersive interactions with active NPCs have been a focus, price negotiations between merchant NPCs and players remain underexplored. First, passive pricing refers to the limited ability of merchants to modify predefined item prices. Second, passive communication means that merchants can only interact with players in a scripted manner. To tackle these issues and create an active merchant NPC, we propose a merchant framework based on large language models (LLMs), called MART, which consists of an appraiser module and a negotiator module. We conducted two experiments to explore various implementation options under different training methods and LLM sizes, considering a range of possible game environments. Our findings indicate that finetuning methods, such as supervised finetuning (SFT) and knowledge distillation (KD), are effective in using smaller LLMs to implement active merchant NPCs. Additionally, we found three irregular cases arising from the responses of LLMs.

arxiv情報

著者	Byungjun Kim,Minju Kim,Dayeon Seo,Bugeun Kim
発行日	2025-05-27 16:23:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

A Structured Unplugged Approach for Foundational AI Literacy in Primary Education

投稿日: 2025年5月28日作成者: jarxiv

要約

若い世代は、インテリジェントなテクノロジーによってますます形作られる世界で成長しており、それらを批判的に理解しナビゲートするためのスキルを開発するために初期のAIリテラシーを重要にしています。
ただし、この分野の教育は、ツールベースの学習を強調し、基礎となる概念を理解することよりも使用を優先します。
この知識の欠如は、非専門家、特に子どもたち、誤解、非現実的な期待、およびバイアスやステレオタイプを認識する際の困難になりやすいことになります。
このホワイトペーパーでは、プライマリカリキュラムに密接に関連し、関心のあるコア数学的要素を構築し、AIの概念化、データ表現、分類推論、および評価を強化することにより、小学生の基本的なAIリテラシーを促進する構造化された複製可能な教育アプローチを提案します。
私たちのアプローチの有効性を評価するために、2つのクラスで31人の5年生の学生との実証研究を実施し、テスト後と満足度調査を通じて進捗を評価しました。
私たちの結果は、用語の理解と使用法の改善、特徴の説明、論理的推論、評価スキルを示しており、学生は意思決定プロセスとその制限のより深い理解を示しています。
さらに、このアプローチは魅力的であることが証明され、学生は特にAIの概念を現実世界の推論に結びつける活動を楽しんでいます。
材料：https：//github.com/tail-unica/ai-literacy-primary-ed。

要約(オリジナル)

Younger generations are growing up in a world increasingly shaped by intelligent technologies, making early AI literacy crucial for developing the skills to critically understand and navigate them. However, education in this field often emphasizes tool-based learning, prioritizing usage over understanding the underlying concepts. This lack of knowledge leaves non-experts, especially children, prone to misconceptions, unrealistic expectations, and difficulties in recognizing biases and stereotypes. In this paper, we propose a structured and replicable teaching approach that fosters foundational AI literacy in primary students, by building upon core mathematical elements closely connected to and of interest in primary curricula, to strengthen conceptualization, data representation, classification reasoning, and evaluation of AI. To assess the effectiveness of our approach, we conducted an empirical study with thirty-one fifth-grade students across two classes, evaluating their progress through a post-test and a satisfaction survey. Our results indicate improvements in terminology understanding and usage, features description, logical reasoning, and evaluative skills, with students showing a deeper comprehension of decision-making processes and their limitations. Moreover, the approach proved engaging, with students particularly enjoying activities that linked AI concepts to real-world reasoning. Materials: https://github.com/tail-unica/ai-literacy-primary-ed.

arxiv情報

著者	Maria Cristina Carrisi,Mirko Marras,Sara Vergallo
発行日	2025-05-27 16:23:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.ET | コメントを受け付けていません

Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

投稿日: 2025年5月28日作成者: jarxiv

要約

生成されたコンテンツにおける事実の誤りは、大規模な言語モデル（LLM）の遍在的な展開における主要な関心事の1つです。
事前の調査結果は、LLMSが生成されたコンテンツの事実上の誤りを（つまり、事実確認後の生成）検出できることを示唆しています。
この作業では、世代の時点での事実のリコールの正確性を決定するLLMSの内部コンパスの存在を支持する証拠を提供します。
特定の主題エンティティと関係について、LLMSは、正しい属性（有効なエンティティリレーションアトリビュートリブレットを形成する）を思い出すことができるかどうかを決定するトランスの残留ストリームで線形機能を内部的にエンコードすることを実証します。
この自己認識信号は、マイナーなフォーマットのバリエーションに対して堅牢です。
さまざまな例選択戦略を介して、コンテキスト摂動の影響を調査します。
モデルのサイズとトレーニングのダイナミクス全体のスケーリング実験は、トレーニング中に自己認識が急速に現れることを強調し、中間層のピークがあります。
これらの調査結果は、LLMS内の固有の自己監視機能を明らかにし、その解釈可能性と信頼性に貢献しています。

要約(オリジナル)

Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs’ internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer’s residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.

arxiv情報

著者	Hovhannes Tamoyan,Subhabrata Dutta,Iryna Gurevych
発行日	2025-05-27 16:24:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models

投稿日: 2025年5月28日作成者: jarxiv

要約

大規模な言語モデル（LLMS）における事実性は、持続的な課題です。
現在のベンチマークは、多くの場合、パラメトリックな知識から構造化された多録音表形式出力を生成する重要な能力を見落として、短い事実に基づいた回答を評価します。
このリレーショナルファクトの検索は、個々の事実がモデルに知られている場合でも、分離されたポイントごとのクエリよりも実質的に困難であることを実証し、出力の次元（例：属性やレコードの数）に敏感な個別の障害モードを公開します。
この不足していない機能を体系的に評価するために、LerationalFactqa、多様な自然言語の質問（SQLと組み合わせた）と金標準の表の回答を特徴とする新しいベンチマークを紹介します。
RelationalFactQAは、さまざまなクエリの複雑さ、出力サイズ、およびデータ特性にわたって分析を可能にします。
私たちの実験では、最先端のLLMでさえ、リレーショナル出力の生成における25％の事実上の精度を超えないことを大幅に争い、出力の次元が増加するにつれてパフォーマンスが著しく低下することが明らかになりました。
これらの調査結果は、構造化された事実知識を統合し、LLMの事実性の将来の進歩を測定するための重要なリソースとしてリレーショナルファクトを確立する現在のLLMSの能力に重大な制限を強調しています。

要約(オリジナル)

Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs’ ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.

arxiv情報

著者	Dario Satriani,Enzo Veltri,Donatello Santoro,Paolo Papotti
発行日	2025-05-27 16:33:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.DB | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント