jarxiv | Japanese arxiv | ページ 458

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

投稿日: 2025年5月19日作成者: jarxiv

要約

Deepseek-R1は、その卓越した推論能力とオープンソース戦略で有名であり、グローバルな人工知能環境に大きな影響を与えています。
ただし、顕著な安全性の欠点を示します。
ペンシルベニア大学と協力して、シスコの子会社であるRobust Intelligenceが実施した最近の調査により、Deepseek-R1は有害なプロンプトを処理する際に100 \％の攻撃成功率を達成することが明らかになりました。
さらに、複数のセキュリティ企業と研究機関が、モデル内の重要なセキュリティの脆弱性を特定しています。
中国ユニコムは中国の文脈におけるR1の安全脆弱性を明らかにしていますが、R1シリーズの残りの蒸留モデルの安全能力はまだ包括的に評価されていません。
このギャップに対処するために、この研究では、包括的な中国の安全ベンチマークChisafetybenchを利用して、DeepSeek-R1シリーズ蒸留モデルの詳細な安全評価を実施しています。
目的は、蒸留前後の両方で中国の文脈におけるこれらのモデルの安全能力を評価し、モデルの安全性に対する蒸留の悪影響をさらに解明することです。
これらの調査結果に基づいて、DeepSeek-R1モデルシリーズ全体にターゲットを絞った安全性向上を実装します。
評価の結果は、強化されたモデルが顕著な分解なしに推論能力を維持しながら、安全性の大幅な改善を達成することを示しています。
https://github.com/unicomai/deepseek-r1-safeで安全性を高めるモデルをオープンソースして、DeepSeekモデルの将来の研究と最適化の貴重なリソースとして機能します。

要約(オリジナル)

DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100\% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for the entire DeepSeek-R1 model series. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at https://github.com/UnicomAI/DeepSeek-R1-Safe to serve as a valuable resource for future research and optimization of DeepSeek models.

arxiv情報

著者	Wenjing Zhang,Xuejiao Lei,Zhaoxiang Liu,Limin Han,Jiaojiao Zhao,Junting Guo,Zhenhong Long,Shu Yang,Meijuan An,Beibei Huang,Rongjia Du,Ning Wang,Kai Wang,Shiguo Lian
発行日	2025-05-16 13:29:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CY | コメントを受け付けていません

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

投稿日: 2025年5月19日作成者: jarxiv

要約

効果的な情報検索は、大規模な言語モデル（LLM）の推論と生成の能力を高めるために不可欠です。
最近の研究では、RENFERTION LEARNIS（RL）を使用して、実際の環境でライブ検索エンジンと対話することにより、LLMSの検索機能を改善しました。
これらのアプローチは有望な結果を示していますが、2つの主要な課題に直面しています。（1）制御されていないドキュメントの品質：検索エンジンによって返されるドキュメントの品質は予測不可能であり、トレーニングプロセスにノイズと不安定性を導入します。
（2）非常に高いAPIコスト：RLトレーニングには、頻繁にロールアウトが必要であり、潜在的に数十万の検索リクエストが関与し、かなりのAPI費用がかかり、スケーラビリティを厳しく制限します。
これらの課題に対処するために、トレーニング中にシミュレートされた検索で実際の検索エンジンを使用するLLMの機能を奨励する新しいRLフレームワークであるZerosearchを紹介します。
私たちのアプローチは、LLMをクエリに応じて有用なドキュメントの両方を生成できる検索モジュールに変換するための軽量の監視された微調整から始まります。
RLトレーニング中に、生成されたドキュメントの品質を徐々に低下させるカリキュラムベースのロールアウト戦略を採用し、モデルの推論能力をますます困難な検索シナリオにさらすことで徐々に引き出します。
広範な実験は、Zerosearchが3B LLMを検索モジュールとして使用してLLMの検索機能を効果的にインセンティブすることを示しています。
驚くべきことに、7B検索モジュールは実際の検索エンジンに匹敵するパフォーマンスを達成し、14B検索モジュールでもそれを上回ります。
さらに、さまざまなパラメーターサイズのベースモデルと命令チューニングされたモデルの両方でよく一般化し、幅広いRLアルゴリズムと互換性があります。

要約(オリジナル)

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs’ search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

arxiv情報

著者	Hao Sun,Zile Qiao,Jiayan Guo,Xuanbo Fan,Yingyan Hou,Yong Jiang,Pengjun Xie,Yan Zhang,Fei Huang,Jingren Zhou
発行日	2025-05-16 13:53:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

投稿日: 2025年5月19日作成者: jarxiv

要約

大規模な言語モデル（LLMS）は、リアルタイムの質問と検索の生成のために、エッジプラットフォームとクラウドプラットフォーム間でますます展開されています。
ただし、分散システムで長いコンテキストを処理するには、高い計算オーバーヘッド、メモリ使用量、およびネットワーク帯域幅が発生します。
このペーパーでは、中間のコンテキスト概要を保存および再利用するための新しいセマンティックキャッシングアプローチを紹介し、LLMベースのQAワークフローの同様のクエリ全体で効率的な情報を再利用できるようにします。
私たちの方法では、自然QUASTIONS、TRIVIAQA、および合成ARXIVデータセットで実証されているように、完全なドキュメント処理に匹敵する回答の精度を維持しながら、冗長な計算を最大50〜60％削減します。
このアプローチは、リアルタイムAIアシスタントにとって重要な計算コストと応答の品質のバランスを取ります。

要約(オリジナル)

Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.

arxiv情報

著者	Camille Couturier,Spyros Mastorakis,Haiying Shen,Saravan Rajmohan,Victor Rühle
発行日	2025-05-16 14:04:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.IR, cs.LG, I.2.7 | コメントを受け付けていません

SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

投稿日: 2025年5月19日作成者: jarxiv

要約

最近、大きな推論モデルは、さまざまなタスクで例外的なパフォーマンスを示しています。
ただし、推論モデルは、些細なクエリと複雑なクエリの両方を非効率的に過度に処理し、リソースの無駄と延長されたユーザーの遅延につながります。
この課題に対処するために、私たちはセルフバッジェール – 効率的な推論のための自己適応的制御可能な推論戦略を提案します。
私たちのアプローチでは、デュアルフェーズトレーニングパラダイムを採用しています。まず、モデルはクエリの難易度に基づいて推論コストを事前に推定することを学びます。
次に、補強学習のために予算誘導GPROを導入します。これは、出力の長さを短縮しながら精度を効果的に維持します。
セルフバジェットを使用すると、ユーザーは生成時間を予測し、プロセスの継続または中断について情報に基づいた決定を下すことができます。
さらに、当社の方法により、事前に埋めるトークン予算を介して推論長の直接操作が可能になります。
実験結果は、セルフバッジェールが問題の複雑さに応じて予算を合理的に割り当てることができることを示しており、ほぼ未模様の精度を維持しながら、数学ベンチマークで最大74.47％の応答長圧縮を達成します。

要約(オリジナル)

Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models inefficiently over-process both trivial and complex queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter – a self-adaptive controllable reasoning strategy for efficient reasoning. Our approach adopts a dual-phase training paradigm: first, the model learns to pre-estimate the reasoning cost based on the difficulty of the query. Then, we introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. SelfBudgeter allows users to anticipate generation time and make informed decisions about continuing or interrupting the process. Furthermore, our method enables direct manipulation of reasoning length via pre-filling token budget. Experimental results demonstrate that SelfBudgeter can rationally allocate budgets according to problem complexity, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy.

arxiv情報

著者	Zheng Li,Qingxiu Dong,Jingyuan Ma,Di Zhang,Zhifang Sui
発行日	2025-05-16 14:08:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

投稿日: 2025年5月19日作成者: jarxiv

要約

大規模な言語モデルは、印象的な推論能力を実証していますが、知識貯蔵庫によって本質的に制限されています。
検索された推論は、LLMSが外部リソースを照会できるようにすることにより、この制限を軽減しますが、既存の方法はしばしば無関係または騒々しい情報を取得し、正確な推論を妨げます。
このペーパーでは、新しい「検索とリファイン – デューリング」パラダイムを採用する補強学習後の枠組みであるAutoreFineを提案します。
AutoreFineは、連続した検索コール間の明示的な知識の改良ステップを導入し、回答を生成する前に証拠を反復的にフィルタリング、蒸留、整理することを可能にします。
さらに、グループ相対ポリシーの最適化を使用して、回答の正確性報酬とともに、調整された検索固有の報酬を組み込みます。
シングルホップおよびマルチホップQAベンチマークの実験は、特に複雑でマルチホップの推論シナリオで、自動化が既存のアプローチを大幅に上回ることを示しています。
詳細な分析によると、頻繁で高品質の検索が発行され、証拠を効果的に統合することが示されています。

要約(オリジナル)

Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new “search-and-refine-during-think” paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.

arxiv情報

著者	Yaorui Shi,Shihan Li,Chang Wu,Zhiyuan Liu,Junfeng Fang,Hengxing Cai,An Zhang,Xiang Wang
発行日	2025-05-16 14:11:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

投稿日: 2025年5月19日作成者: jarxiv

要約

メンタルヘルスサポートのためのAIシステムの進歩は、特に外傷治療のための治療的会話データへのアクセスが制限されていることによって妨げられています。
私たちは、心的外傷後ストレス障害（PTSD）の長期曝露療法プロトコルに基づいて、3,000の療法会話の合成ベンチマークデータセットであるトラウマの千の声を提示します。
データセットは500のユニークなケースで構成されており、それぞれが最初の不安から苦痛へのピークへの治療の進行を反映する6つの会話視点を通して調査されています。
多様な人口統計プロファイル（18〜80歳、M = 49.3、49.4％、女性44.4％、6.2％の非バイナリ）、20種類の外傷タイプ、および10個の外傷関連の行動を、決定論的およびプロベリスティックな生成方法を使用して10個組み込みます。
分析では、外傷タイプ（暴力を目撃し、10.6％、いじめの10.2％）と症状（悪夢23.4％、薬物乱用20.8％）の現実的な分布が明らかになりました。
臨床専門家は、データセットの治療的忠実度を検証し、感情的な深さを強調しながら、より大きな信頼性のための改良を提案しました。
また、モデル応答を評価するための標準化されたメトリックを備えた感情的な軌跡ベンチマークを開発しました。
このプライバシーを提供するデータセットは、外傷中心のメンタルヘルスデータの重要なギャップに対処し、患者向けアプリケーションと臨床医のトレーニングツールの両方を進めるための貴重なリソースを提供します。

要約(オリジナル)

The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4% male, 44.4% female, 6.2% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6%, bullying 10.2%) and symptoms (nightmares 23.4%, substance abuse 20.8%). Clinical experts validated the dataset’s therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.

arxiv情報

著者	Suhas BN,Andrew M. Sherrill,Rosa I. Arriaga,Chris W. Wiese,Saeed Abdullah
発行日	2025-05-16 14:12:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: 68T50, cs.AI, cs.CL, cs.CY, cs.HC, cs.LG, H.5.2 | コメントを受け付けていません

XRAG: eXamining the Core — Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation

投稿日: 2025年5月19日作成者: jarxiv

要約

検索された生成（RAG）は、関連データの検索を大規模な言語モデル（LLM）の生成機能と相乗的に、生成された出力が文脈的に関連するだけでなく、正確で最新であることを保証します。
高度なRAGモジュールの基礎コンポーネントのパフォーマンスの徹底的な評価を促進するオープンソースのモジュラーコードベースであるXragを紹介します。
これらのコンポーネントは、レトリエバル前、検索、retリーバル後、および生成の4つのコアフェーズに体系的に分類されています。
再構成されたデータセット全体でそれらを体系的に分析し、それらの有効性のための包括的なベンチマークを提供します。
RAGシステムの複雑さが引き続きエスカレートするにつれて、RAGシステムの潜在的な障害ポイントを特定する重要な必要性を強調しています。
RAGエンジニアリングに固有の障害点を分析するために、一連の実験方法と診断テストプロトコルを策定します。
その後、これらのモジュールの全体的なパフォーマンスを強化することを目的とした特注のソリューションを提供しました。
私たちの作業は、RAGシステムの高度なコアコンポーネントのパフォーマンスを徹底的に評価し、一般的な障害点の最適化に関する洞察を提供します。

要約(オリジナル)

Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent data with the generative capabilities of Large Language Models (LLMs), ensuring that the generated output is not only contextually relevant but also accurate and current. We introduce XRAG, an open-source, modular codebase that facilitates exhaustive evaluation of the performance of foundational components of advanced RAG modules. These components are systematically categorized into four core phases: pre-retrieval, retrieval, post-retrieval, and generation. We systematically analyse them across reconfigured datasets, providing a comprehensive benchmark for their effectiveness. As the complexity of RAG systems continues to escalate, we underscore the critical need to identify potential failure points in RAG systems. We formulate a suite of experimental methodologies and diagnostic testing protocols to dissect the failure points inherent in RAG engineering. Subsequently, we proffer bespoke solutions aimed at bolstering the overall performance of these modules. Our work thoroughly evaluates the performance of advanced core components in RAG systems, providing insights into optimizations for prevalent failure points.

arxiv情報

著者	Qianren Mao,Yangyifei Luo,Qili Zhang,Yashuo Luo,Zhilong Cao,Jinlong Zhang,HanWen Hao,Zhijun Chen,Weifeng Jiang,Junnan Liu,Xiaolong Wang,Zhenting Huang,Zhixing Tan,Sun Jie,Bo Li,Xudong Liu,Richong Zhang,Jianxin Li
発行日	2025-05-16 14:13:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Temporal fine-tuning for early risk detection

投稿日: 2025年5月19日作成者: jarxiv

要約

Web上の早期リスク検出（ERD）は、社会的および健康問題に直面しているユーザーを迅速に特定することを目指しています。
ユーザーはポストごとに分析され、正しい回答と迅速な回答を保証する必要があります。これは、重要なシナリオで特に困難です。
ERDには、分類精度を最適化し、検出遅延を最小限に抑えることが含まれます。
標準的な分類メトリックでは十分ではなく、精度と遅延を明示的に考慮するERDE（Theta）などの特定のメトリックに頼っています。
現在の研究では、多目的アプローチの適用、分類パフォーマンスの優先順位付け、意思決定時間の個別の基準の確立に焦点を当てています。
この作業では、学習プロセス内に明示的に時間を組み込むことにより、変圧器ベースのモデルを調整できるようにする、まったく異なる戦略である一時的な微調整を提案します。
私たちの方法では、完全なユーザーポスト履歴を分析し、さまざまなコンテキストを考慮したチューニングモデルを分析し、時間メトリックを使用してトレーニングパフォーマンスを評価できます。
私たちは、スペイン語のうつ病と摂食障害のタスクにおける提案を評価し、MentalRiskes 2023の最良のモデルと比較して競争結果を達成しました。
このようにして、トランスの力を適切に利用することにより、精度と速度を単一の目的として組み合わせることにより、ERDに対処することができます。

要約(オリジナル)

Early Risk Detection (ERD) on the Web aims to identify promptly users facing social and health issues. Users are analyzed post-by-post, and it is necessary to guarantee correct and quick answers, which is particularly challenging in critical scenarios. ERD involves optimizing classification precision and minimizing detection delay. Standard classification metrics may not suffice, resorting to specific metrics such as ERDE(theta) that explicitly consider precision and delay. The current research focuses on applying a multi-objective approach, prioritizing classification performance and establishing a separate criterion for decision time. In this work, we propose a completely different strategy, temporal fine-tuning, which allows tuning transformer-based models by explicitly incorporating time within the learning process. Our method allows us to analyze complete user post histories, tune models considering different contexts, and evaluate training performance using temporal metrics. We evaluated our proposal in the depression and eating disorders tasks for the Spanish language, achieving competitive results compared to the best models of MentalRiskES 2023. We found that temporal fine-tuning optimized decisions considering context and time progress. In this way, by properly taking advantage of the power of transformers, it is possible to address ERD by combining precision and speed as a single objective.

arxiv情報

著者	Horacio Thompson,Esaú Villatoro-Tello,Manuel Montes-y-Gómez,Marcelo Errecalde
発行日	2025-05-16 14:17:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Probing Subphonemes in Morphology Models

投稿日: 2025年5月19日作成者: jarxiv

要約

変圧器は形態学的変曲課題で最先端のパフォーマンスを達成していますが、言語や形態学的規則を越えて一般化する能力は限られたままです。
この動作の可能な説明の1つは、これらのモデルが音韻レベルおよび共和学レベルで暗黙の現象を捉えることができる程度です。
音韻上で直接訓練された変圧器でエンコードする音韻特徴を調査し、7つの形態学的に多様な言語でそれを実行するための言語非存在プローブ方法を導入します。
トルコ語での最終的な爆発的な献身など、ローカルな音韻的特徴が音素埋め込みでよく捕らえられているのに対し、母音の調和のような長距離依存関係は、変圧器のエンコーダーでよりよく表されることを示しています。
最後に、これらの調査結果が、特に共和学的特徴の習得の役割に関して、形態モデルをトレーニングするための経験的戦略をどのように知らせるかについて説明します。

要約(オリジナル)

Transformers have achieved state-of-the-art performance in morphological inflection tasks, yet their ability to generalize across languages and morphological rules remains limited. One possible explanation for this behavior can be the degree to which these models are able to capture implicit phenomena at the phonological and subphonemic levels. We introduce a language-agnostic probing method to investigate phonological feature encoding in transformers trained directly on phonemes, and perform it across seven morphologically diverse languages. We show that phonological features which are local, such as final-obstruent devoicing in Turkish, are captured well in phoneme embeddings, whereas long-distance dependencies like vowel harmony are better represented in the transformer’s encoder. Finally, we discuss how these findings inform empirical strategies for training morphological models, particularly regarding the role of subphonemic feature acquisition.

arxiv情報

著者	Gal Astrach,Yuval Pinter
発行日	2025-05-16 14:27:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision

投稿日: 2025年5月19日作成者: jarxiv

要約

アカデミックワークフローにおける大規模な言語モデル（LLM）の採用の増加にもかかわらず、高品質の科学的執筆をサポートすることに関しては、その能力は限られたままです。
ほとんどの既存のシステムは、汎用の科学的テキスト生成のために設計されており、セクション全体で概念的な一貫性など、表面レベルの研磨を超えた研究コミュニケーションの洗練された要求を満たすことができません。
さらに、アカデミックライティングは本質的に反復的で修正駆動型であり、直接プロンプトベースのパラダイムによって十分にサポートされていないプロセスです。
これらのシナリオに対処するために、アカデミックペーパーリビジョンのための人間と協力の枠組みを提案します。
最初に、現実的でセクションレベルの科学的改訂を反映した140,000を超える指導反応ペアで注釈が付けられた、上位の会場から7,040の研究論文の包括的なデータセットを紹介します。
データセットに基づいて、1.5Bから14Bのパラメーターの範囲のコンテキスト対応の命令ガイド付きライティング支援を提供するように設計されたオープンソースLLMSの最初のスイートであるXtragptを開発します。
広範な実験では、Xtragptが同じスケールのベースラインを大幅に上回り、独自のシステムの品質に近づくことを検証します。
自動化された好みの評価と人間の評価の両方が、科学ドラフトの改善におけるモデルの有効性を確認します。

要約(オリジナル)

Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited when it comes to supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision. We first introduce a comprehensive dataset of 7,040 research papers from top-tier venues annotated with over 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. Building on the dataset, we develop XtraGPT, the first suite of open-source LLMs, designed to provide context-aware, instruction-guided writing assistance, ranging from 1.5B to 14B parameters. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of our models in improving scientific drafts.

arxiv情報

著者	Nuo Chen,Andre Lin HuiKai,Jiaying Wu,Junyi Hou,Zining Zhang,Qian Wang,Xidong Wang,Bingsheng He
発行日	2025-05-16 15:02:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント