jarxiv | Japanese arxiv | ページ 1475

Small but Mighty: Enhancing Time Series Forecasting with Lightweight LLMs

投稿日: 2025年3月6日作成者: jarxiv

要約

LLMは時系列予測において顕著な可能性を示していますが、それらの実際の展開は、過度の計算要求とメモリフットプリントによって制約されたままです。
既存のLLMベースのアプローチは、通常、3つの重大な制限に悩まされています。数値系列パターンの処理における非効率的なパラメーター利用。
連続的な時間信号と離散テキストの埋め込みの間のモダリティの不整列。
リアルタイムの専門知識統合の柔軟性。
効率的かつ正確な時系列予測のために、Sub-3BパラメーターSLMの最初の体系的な調査であるSmetiamesを提示します。
私たちのアプローチは、3つの主要な革新に焦点を当てています。記述的統計的特徴を通じてテキストセマンティクスを橋渡しする統計的に強化されたプロンプトメカニズム。
学習可能なパラメーターを介して、一時的なパターンを言語モデルトークンスペースと整列させる適応融合融合埋め込みアーキテクチャ。
SLMSの計算効率によって有効になった動的な混合フレームワークは、基本予測とドメイン固有のモデルを適応的に組み合わせて組み合わせています。
7つのベンチマークデータセットにわたる広範な評価は、3BパラメーターSLMが5つのプライマリデータセットで最先端のパフォーマンスを達成し、7BパラメーターLLMベースラインと比較して3.8倍のトレーニングと5.2倍低いメモリ消費を維持することを示しています。
特に、提案されているモデルは、より良い学習能力を示し、従来のLLMよりも12.3％低いMSEを達成しています。
アブレーション研究では、統計的プロンプトとクロスモーダル融合モジュールがそれぞれ15.7％および18.2％のエラー減少が長時間の予測タスクに貢献していることを検証しています。
効率性 – アクセラシーのトレードオフ状況を再定義することにより、この作業は、実用的な時系列予測のためのリソース集約型LLMの実行可能な代替品としてSLMを確立します。
コードとモデルはhttps://github.com/xiyan1234567/smetimesで入手できます。

要約(オリジナル)

While LLMs have demonstrated remarkable potential in time series forecasting, their practical deployment remains constrained by excessive computational demands and memory footprints. Existing LLM-based approaches typically suffer from three critical limitations: Inefficient parameter utilization in handling numerical time series patterns; Modality misalignment between continuous temporal signals and discrete text embeddings; and Inflexibility for real-time expert knowledge integration. We present SMETimes, the first systematic investigation of sub-3B parameter SLMs for efficient and accurate time series forecasting. Our approach centers on three key innovations: A statistically-enhanced prompting mechanism that bridges numerical time series with textual semantics through descriptive statistical features; A adaptive fusion embedding architecture that aligns temporal patterns with language model token spaces through learnable parameters; And a dynamic mixture-of-experts framework enabled by SLMs’ computational efficiency, adaptively combining base predictions with domain-specific models. Extensive evaluations across seven benchmark datasets demonstrate that our 3B-parameter SLM achieves state-of-the-art performance on five primary datasets while maintaining 3.8x faster training and 5.2x lower memory consumption compared to 7B-parameter LLM baselines. Notably, the proposed model exhibits better learning capabilities, achieving 12.3% lower MSE than conventional LLM. Ablation studies validate that our statistical prompting and cross-modal fusion modules respectively contribute 15.7% and 18.2% error reduction in long-horizon forecasting tasks. By redefining the efficiency-accuracy trade-off landscape, this work establishes SLMs as viable alternatives to resource-intensive LLMs for practical time series forecasting. Code and models are available at https://github.com/xiyan1234567/SMETimes.

arxiv情報

著者	Haoran Fan,Bin Li,Yixuan Weng,Shoujun Zhou
発行日	2025-03-05 15:27:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Towards Understanding Text Hallucination of Diffusion Models via Local Generation Bias

投稿日: 2025年3月6日作成者: jarxiv

要約

スコアベースの拡散モデルは、現実的な画像、オーディオ、ビデオデータを生成する際に信じられないほどのパフォーマンスを達成しました。
これらのモデルは、印象的な詳細を備えた高品質のサンプルを生成しますが、歪んだ指や意味のない幻覚テキストなど、非現実的なアーティファクトを導入することがよくあります。
このペーパーでは、拡散モデルが個々のシンボルを正しく生成しますが、無意味な方法でそれらを組み立てるテキストの幻覚に焦点を当てています。
実験的な調査を通じて、そのような現象はそれがネットワークのローカルジェネレーションバイアスに起因することを一貫して観察します。
ネットワークの除去は、特にデータ分布の異なる寸法がほぼペアワイズ独立している場合、高度に相関するローカル領域に大きく依存する出力を生成する傾向があります。
この動作は、グローバルな分布を各シンボルの個別の独立した分布に分解する生成プロセスにつながり、最終的には基礎となる文法を含むグローバル構造をキャプチャできません。
興味深いことに、このバイアスは、Global依存関係をモデル化する構造を持つMLPや変圧器など、さまざまな除去ネットワークアーキテクチャ全体で持続します。
これらの調査結果は、除去モデルの暗黙のバイアスの結果として、テキストを超えて広がる他のタイプの幻覚を理解することに関する洞察を提供します。
さらに、ハイパーキューブ上の2層MLP学習パリティポイントを含む特定のケースのトレーニングダイナミクスを理論的に分析し、基礎となるメカニズムの説明を提供します。

要約(オリジナル)

Score-based diffusion models have achieved incredible performance in generating realistic images, audio, and video data. While these models produce high-quality samples with impressive details, they often introduce unrealistic artifacts, such as distorted fingers or hallucinated texts with no meaning. This paper focuses on textual hallucinations, where diffusion models correctly generate individual symbols but assemble them in a nonsensical manner. Through experimental probing, we consistently observe that such phenomenon is attributed it to the network’s local generation bias. Denoising networks tend to produce outputs that rely heavily on highly correlated local regions, particularly when different dimensions of the data distribution are nearly pairwise independent. This behavior leads to a generation process that decomposes the global distribution into separate, independent distributions for each symbol, ultimately failing to capture the global structure, including underlying grammar. Intriguingly, this bias persists across various denoising network architectures including MLP and transformers which have the structure to model global dependency. These findings also provide insights into understanding other types of hallucinations, extending beyond text, as a result of implicit biases in the denoising models. Additionally, we theoretically analyze the training dynamics for a specific case involving a two-layer MLP learning parity points on a hypercube, offering an explanation of its underlying mechanism.

arxiv情報

著者	Rui Lu,Runzhe Wang,Kaifeng Lyu,Xitai Jiang,Gao Huang,Mengdi Wang
発行日	2025-03-05 15:28:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG | コメントを受け付けていません

Decoupled Recommender Systems: Exploring Alternative Recommender Ecosystem Designs

投稿日: 2025年3月6日作成者: jarxiv

要約

推奨エコシステムは、研究の新たな主題です。
このような研究では、アルゴリズム、推奨消費者、およびアイテムプロバイダーの特性が、システムのダイナミクスと長期的な結果にどのように影響するかを調べます。
この一連の研究でまだ広く調査されていない建築の可能性の1つは、提供するプラットフォームから推奨アルゴリズムが分離される構成の結果です。
これは、「フレンドリーな近隣アルゴリズムストア」または「ミドルウェア」モデルと呼ばれることもあります。
このようなアーキテクチャが、消費者、プロバイダー、推奨プラットフォーム間でユーティリティのさまざまな分配を提供する方法に特に興味があります。
この論文では、アルゴリズムの選択を組み込んだ推奨エコシステムのモデルを作成し、そのような設計の結果を調べます。

要約(オリジナル)

Recommender ecosystems are an emerging subject of research. Such research examines how the characteristics of algorithms, recommendation consumers, and item providers influence system dynamics and long-term outcomes. One architectural possibility that has not yet been widely explored in this line of research is the consequences of a configuration in which recommendation algorithms are decoupled from the platforms they serve. This is sometimes called ‘the friendly neighborhood algorithm store’ or ‘middleware’ model. We are particularly interested in how such architectures might offer a range of different distributions of utility across consumers, providers, and recommendation platforms. In this paper, we create a model of a recommendation ecosystem that incorporates algorithm choice and examine the outcomes of such a design.

arxiv情報

著者	Anas Buhayh,Elizabeth McKinnie,Robin Burke
発行日	2025-03-05 15:42:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.HC, cs.IR | コメントを受け付けていません

Measuring and identifying factors of individuals’ trust in Large Language Models

投稿日: 2025年3月6日作成者: jarxiv

要約

大規模な言語モデル（LLM）は、人間のように見える会話交換に従事することができます。
会話はユーザーとLLMの間で信頼を引き出すことができますが、希少な経験的研究は、LLMの信頼性または一般的なAIに対する人間の信頼を超えて、人間とLMの文脈における信頼の形成を調査しました。
ここでは、LLMSに対する個人の信頼を測定する新しいフレームワークとして、Trust-in-llmsインデックス（Tillmi）を紹介し、McAllisterの認知的および感情的な信頼の側面をLLMと人間の相互作用に拡張します。
ティルミを心理測定スケールとして開発し、LLMシミュレーション妥当性と呼ばれる新しいプロトコルでプロトタイプを付けました。
LLMベースのスケールは、1,000人の米国の回答者のサンプルで検証されました。
探索的因子分析により、2因子構造が特定されました。
次に、冗長性のために2つの項目が削除され、2因子構造の最終的な6項目スケールが得られました。
別のサブサンプルでの確認因子分析は、強いモデル適合を示しました（$ cfi = .995 $、$ tli = .991 $、$ rmsea = .046 $、$ p_ {x^2}> .05 $）。
収束妥当性分析により、LLMSへの信頼は、経験、外向性、および認知的柔軟性への開放性と正の相関があるが、神経症と否定的であることが明らかになりました。
これらの調査結果に基づいて、ティルミの要因を「LLMSとの親密さ」（感情的な次元）および「LLMSへの依存」（認知次元）と解釈しました。
若い男性は、年配の女性と比較してLLMに依存していると依存していました。
LLMSの直接的な経験のない個人は、LLMSのユーザーと比較してより低いレベルの信頼を示しました。
これらの調査結果は、AI主導の言葉によるコミュニケーションに対する信頼を測定し、責任あるデザインを通知し、バランスのとれた人間とのコラボレーションを促進するための新しい経験的基盤を提供します。

要約(オリジナル)

Large Language Models (LLMs) can engage in human-looking conversational exchanges. Although conversations can elicit trust between users and LLMs, scarce empirical research has examined trust formation in human-LLM contexts, beyond LLMs’ trustworthiness or human trust in AI in general. Here, we introduce the Trust-In-LLMs Index (TILLMI) as a new framework to measure individuals’ trust in LLMs, extending McAllister’s cognitive and affective trust dimensions to LLM-human interactions. We developed TILLMI as a psychometric scale, prototyped with a novel protocol we called LLM-simulated validity. The LLM-based scale was then validated in a sample of 1,000 US respondents. Exploratory Factor Analysis identified a two-factor structure. Two items were then removed due to redundancy, yielding a final 6-item scale with a 2-factor structure. Confirmatory Factor Analysis on a separate subsample showed strong model fit ($CFI = .995$, $TLI = .991$, $RMSEA = .046$, $p_{X^2} > .05$). Convergent validity analysis revealed that trust in LLMs correlated positively with openness to experience, extraversion, and cognitive flexibility, but negatively with neuroticism. Based on these findings, we interpreted TILLMI’s factors as ‘closeness with LLMs’ (affective dimension) and ‘reliance on LLMs’ (cognitive dimension). Younger males exhibited higher closeness with- and reliance on LLMs compared to older women. Individuals with no direct experience with LLMs exhibited lower levels of trust compared to LLMs’ users. These findings offer a novel empirical foundation for measuring trust in AI-driven verbal communication, informing responsible design, and fostering balanced human-AI collaboration.

arxiv情報

著者	Edoardo Sebastiano De Duro,Giuseppe Alessandro Veltri,Hudson Golino,Massimo Stella
発行日	2025-03-05 15:52:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.HC | コメントを受け付けていません

One-Shot Imitation under Mismatched Execution

投稿日: 2025年3月6日作成者: jarxiv

要約

プロンプトとしての人間のデモンストレーションは、ロボットをプログラムして長老型操作タスクを実行する強力な方法です。
ただし、これらのデモンストレーションをロボットに実行可能なアクションに変換することは、動きのスタイルと物理的能力の実行の不一致により、重要な課題をもたらします。
既存の方法は、人間のロボットペアのデータに依存します。これは、スケーリングするのが不可能であるか、実際に壊れることが多いフレームレベルの視覚的類似性に大きく依存しています。
これらの課題に対処するために、最適な輸送コストを使用して人間とロボットのタスク実行を自動的に整列させる新しいフレームワークであるRhymeを提案します。
長老のロボットのデモンストレーションを考えると、ライムは、短いホリゾンのヒューマンクリップを取得および構成することにより、意味的に同等の人間ビデオを統合します。
このアプローチは、ペアのデータを必要とせずに効果的なポリシートレーニングを容易にします。
ライムは、シミュレーションと実際の人間の手で、さまざまな交差体のデモ装置を正常に模倣し、以前の方法と比較してタスクの成功を50％以上増加させました。
https://portal-cornell.github.io/rhyme/でコードとデータセットをリリースします。

要約(オリジナル)

Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods either depend on human-robot paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically aligns human and robot task executions using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50\% increase in task success compared to previous methods. We release our code and datasets at https://portal-cornell.github.io/rhyme/.

arxiv情報

著者	Kushal Kedia,Prithwish Dan,Angela Chao,Maximus Adrian Pace,Sanjiban Choudhury
発行日	2025-03-05 16:07:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, cs.RO | コメントを受け付けていません

MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation

投稿日: 2025年3月6日作成者: jarxiv

要約

自動質問生成は、エンゲージメント、教育的価値、批判的思考を刺激する能力などの要因を考慮することにより、質問の質を評価することを伴う重要なタスクです。
これらの側面には、自動化されたシステムが現在欠けている人間のような理解と判断が必要です。
ただし、人間の評価は、生成された質問の大規模なサンプルでは費用がかかり、非現実的です。
したがって、大規模な言語モデル（LLM）を活用して自動化された質問生成システムによって生成された質問の評価プロセスを自動化する新しいシステム、ミラー（最適化された評価のマルチLITのレビューと応答）を提案します。
GPT-4、Gemini、Llama2-70bなど、いくつかの最先端のLLMを実験しました。
人間の評価メトリックのスコア、すなわち、関連性、適切性、斬新、複雑さ、文法性は、ミラーと呼ばれるフィードバックベースのアプローチを使用すると改善され、人間のベースラインスコアに近づく傾向があることが観察されました。
さらに、Pearsonのフィードバックベースのアプローチを使用すると、GPT-4と人間の専門家の間のピアソンの相関係数が改善されたことが観察されました。
エラー分析は、提案されたアプローチであるミラーが、関連性と適切性を改善するのに大幅に役立つことを示しています。

要約(オリジナル)

Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson’s correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.

arxiv情報

著者	Aniket Deroy,Subhankar Maity,Sudeshna Sarkar
発行日	2025-03-05 16:16:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Provable Benefits of Task-Specific Prompts for In-context Learning

投稿日: 2025年3月6日作成者: jarxiv

要約

現代言語モデルのコンテキスト内学習能力は、シーケンスモデルのより深い数学的理解を動機付けています。
最近の作業の行は、線形注意モデルが、コンテキストウィンドウで提供されたデータからタスクベクトルを暗黙的に学習するために、予測される勾配降下反復をエミュレートできることを示しています。
この作業では、グローバルなタスク分布を条件付きタスク分布の結合に分割できる新しい設定を検討します。
次に、1層の注意モデルを使用して条件付きタスク分布に関連する以前の情報を学習するためのタスク固有のプロンプトと予測ヘッドの使用を調べます。
損失の状況に関する我々の結果は、タスク固有のプロンプトが共分散間平均のデカップリングを容易にすることを示しています。ここでは、プロンプトチューニングが分布の条件付き平均を説明するのに対し、分散はコンテキスト内学習を通じて学習/説明されます。
タスク固有のヘッドを組み込むと、平均成分と分散成分の推定を完全に分離することにより、このプロセスがさらに役立ちます。
この共分散世の視点は、同様に、迅速なトレーニングのトレーニングと注意の重量が、事前トレーニング後の微調整に応じてどのように役立つかを説明しています。

要約(オリジナル)

The in-context learning capabilities of modern language models have motivated a deeper mathematical understanding of sequence models. A line of recent work has shown that linear attention models can emulate projected gradient descent iterations to implicitly learn the task vector from the data provided in the context window. In this work, we consider a novel setting where the global task distribution can be partitioned into a union of conditional task distributions. We then examine the use of task-specific prompts and prediction heads for learning the prior information associated with the conditional task distribution using a one-layer attention model. Our results on loss landscape show that task-specific prompts facilitate a covariance-mean decoupling where prompt-tuning explains the conditional mean of the distribution whereas the variance is learned/explained through in-context learning. Incorporating task-specific head further aids this process by entirely decoupling estimation of mean and variance components. This covariance-mean perspective similarly explains how jointly training prompt and attention weights can provably help over fine-tuning after pretraining.

arxiv情報

著者	Xiangyu Chang,Yingcong Li,Muti Kara,Samet Oymak,Amit K. Roy-Chowdhury
発行日	2025-03-05 16:18:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

AI Governance through Markets

投稿日: 2025年3月6日作成者: jarxiv

要約

このペーパーでは、市場ガバナンスメカニズムは、従来の規制枠組みとともに、人工知能ガバナンス（AI）の重要なアプローチと見なされるべきであると主張しています。
現在のガバナンスアプローチは主に規制に焦点を当てていますが、市場ベースのメカニズムは責任あるAI開発に効果的なインセンティブを提供すると主張します。
市場ガバナンスの4つの新たなベクトルを調べます：保険、監査、調達、およびデューデリジェンスは、これらのメカニズムがAIのリスクと金融リスクの関係をどのように確認し、資本配分の非効率性に対処します。
市場の力だけが社会的利益を適切に保護できると主張していませんが、標準化されたAI開示と市場メカニズムは、安全で責任あるAI開発のための強力なインセンティブを生み出すことができると主張しています。
このペーパーは、規制当局、エコノミスト、および機械学習研究者に、AIガバナンスに対する市場ベースのアプローチを調査および実装するよう促します。

要約(オリジナル)

This paper argues that market governance mechanisms should be considered a key approach in the governance of artificial intelligence (AI), alongside traditional regulatory frameworks. While current governance approaches have predominantly focused on regulation, we contend that market-based mechanisms offer effective incentives for responsible AI development. We examine four emerging vectors of market governance: insurance, auditing, procurement, and due diligence, demonstrating how these mechanisms can affirm the relationship between AI risk and financial risk while addressing capital allocation inefficiencies. While we do not claim that market forces alone can adequately protect societal interests, we maintain that standardised AI disclosures and market mechanisms can create powerful incentives for safe and responsible AI development. This paper urges regulators, economists, and machine learning researchers to investigate and implement market-based approaches to AI governance.

arxiv情報

著者	Philip Moreira Tomei,Rupal Jain,Matija Franklin
発行日	2025-03-05 16:20:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, econ.GN, q-fin.EC | コメントを受け付けていません

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

投稿日: 2025年3月6日作成者: jarxiv

要約

器用な把握は、ロボット工学の根本的でありながら挑戦的な問題のままです。
汎用ロボットは、任意のシナリオで多様なオブジェクトを把握できる必要があります。
ただし、既存の研究は通常、単一オブジェクトの設定や限られた環境などの特定の仮定に依存しており、一般化につながります。
私たちのソリューションは、高レベルのタスクプランナーとして事前に訓練されたビジョン言語モデルを利用し、低レベルのアクションコントローラーとして拡散ベースのポリシーを学習する階層的なフレームワークであるDexGraspVLAです。
重要な洞察は、ドメインシフトの緩和のために模倣学習を効果的に適用できる、ドメイン不変の表現に多様な言語と視覚入力を繰り返し変換することにあります。
したがって、幅広い現実世界のシナリオにわたって堅牢な一般化を可能にします。
特に、私たちの方法は、「ゼロショット」環境で、数千の目に見えないオブジェクト、照明、背景の組み合わせの下で90以上の成功率を達成します。
経験的分析により、環境の変動全体にわたる内部モデルの動作の一貫性がさらに確認され、それによって設計を検証し、その一般化パフォーマンスを説明します。
私たちの仕事が、一般的な器用な握りを達成する上で一歩前進することを願っています。
デモとコードはhttps://dexgraspvla.github.io/にあります。

要約(オリジナル)

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a “zero-shot” environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at https://dexgraspvla.github.io/.

arxiv情報

著者	Yifan Zhong,Xuchuan Huang,Ruochong Li,Ceyao Zhang,Yitao Liang,Yaodong Yang,Yuanpei Chen
発行日	2025-03-05 16:23:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.RO | コメントを受け付けていません

Improving Neutral Point of View Text Generation through Parameter-Efficient Reinforcement Learning and a Small-Scale High-Quality Dataset

投稿日: 2025年3月6日作成者: jarxiv

要約

このホワイトペーパーでは、データセットの構築と、生成的な大手言語モデル（LLMS）を改善するためのトレーニング方法の評価について説明します。
データセットであるSHQ-NPOVデータセットは、300の高品質で人間が作成したクアドルプレットで構成されています。デリケートなトピック、回答、NPOV評価、およびさまざまな視点を詳しく説明するソーステキストへのリンクのセット。
このペーパーの最初の重要な貢献は、データセットと一緒にリリースする人間のピアクリチックおよびアノテータートレーニングの反復ラウンドを通じて、このようなデータセットを作成する新しい方法論です。
2番目の重要な貢献は、NPOV生成を改善するためのパラメーター効率の高い強化学習（PE-RL）のための非常に効果的なトレーニング体制の特定です。
Lora Finetuning（強力なベースライン）、SFT、RLHFを含むPE-RLと複数のベースラインを比較して評価します。
PE-RLは、最強のベースライン（$ 97.06 \％\ rightArrow 99.08 \％$）と比較して、全体的なNPOVの品質を改善するだけでなく、最高の回答を識別するための鍵として識別される特徴の鍵としてもはるかに高いスコア（$ 60.25 \％\ rightArrow 85.21 \％$ 68.74
単純化しすぎないための91.43 \％$）。
定性分析がこれを裏付けています。
最後に、我々の評価では、トレーニングデータセットに表示されるトピックと分離された評価トピックに表示されているトピックの結果の統計的な違いは見つかりません。これは、トレーニングへのアプローチがトピックの一般化から非常に効果的であることを示す強力な証拠を提供します。

要約(オリジナル)

This paper describes the construction of a dataset and the evaluation of training methods to improve generative large language models’ (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e., to provide significantly more informative, diverse and impartial answers. The dataset, the SHQ-NPOV dataset, comprises 300 high-quality, human-written quadruplets: a query on a sensitive topic, an answer, an NPOV rating, and a set of links to source texts elaborating the various points of view. The first key contribution of this paper is a new methodology to create such datasets through iterative rounds of human peer-critique and annotator training, which we release alongside the dataset. The second key contribution is the identification of a highly effective training regime for parameter-efficient reinforcement learning (PE-RL) to improve NPOV generation. We compare and extensively evaluate PE-RL and multiple baselines-including LoRA finetuning (a strong baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline ($97.06\%\rightarrow 99.08\%$), but also scores much higher on features linguists identify as key to separating good answers from the best answers ($60.25\%\rightarrow 85.21\%$ for presence of supportive details, $68.74\%\rightarrow 91.43\%$ for absence of oversimplification). A qualitative analysis corroborates this. Finally, our evaluation finds no statistical differences between results on topics that appear in the training dataset and those on separated evaluation topics, which provides strong evidence that our approach to training PE-RL exhibits very effective out of topic generalization.

arxiv情報

著者	Jessica Hoffmann,Christiane Ahlheim,Zac Yu,Aria Walfrand,Jarvis Jin,Marie Tano,Ahmad Beirami,Erin van Liemt,Nithum Thain,Hakim Sidahmed,Lucas Dixon
発行日	2025-03-05 16:32:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント