jarxiv | Japanese arxiv | ページ 102

Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

投稿日: 2025年6月12日作成者: jarxiv

要約

監視されていない訓練前および転送学習は、特に限られたラベル付きデータを持つ設定で、ニューラルネットワークのトレーニングアルゴリズムを初期化するために一般的に使用されています。
この論文では、高次元の監視学習のサンプルの複雑さに対する監視されていない監視されていない訓練前および転送学習の効果を研究します。
具体的には、オンラインの確率的勾配降下を介して単一層ニューラルネットワークをトレーニングする問題を検討します。
トレーニング前および転送学習（概念シフトの下）は、非常に一般的な仮定の下での多項式要因（次元）によりサンプルの複雑さを減らすことを確立します。
また、サンプルの複雑さの観点からランダムな初期化よりも指数関数的な改善を担当する驚くべき設定を発見します。

要約(オリジナル)

Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.

arxiv情報

著者	Taj Jones-McCormick,Aukosh Jagannath,Subhabrata Sen
発行日	2025-06-11 17:36:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, stat.ML | コメントを受け付けていません

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

投稿日: 2025年6月12日作成者: jarxiv

要約

オートレーリングの大型言語モデル（AR-LLM）は、順次生成において暗黙的な並列性を頻繁に示します。
これに触発されて、私たちは、ネイティブに平行な生成を可能にする新しい生成モデルであるMultiverseを紹介します。
マルチバースは、MapReduceパラダイムを内面化し、3つの段階で自動的に生成します。（i）適応タスク分解のマップ段階、（ii）パラレルサブタスク実行のプロセス段階、（iii）ロスレス結果合成の削減段階。
次に、データ、アルゴリズム、およびシステムの共同設計を備えた実際の多元恒例の推論モデルを構築し、フロンティアAR-LLMSからの迅速かつシームレスな転送を可能にします。
シーケンシャル推論チェーンから始めて、自動化されたLLM支援パイプラインを使用して構造化されたトレーニングデータに変換し、高価な人間の注釈を回避することにより、マルチバース1Kを作成します。
アルゴリズム的には、効率的なトレーニングのために因果関係と互換性を維持しながら、多元宇宙の注意を別々の並列推論ステップに設計します。
体系的には、並列推論を有効にするためにマルチバースエンジンを実装します。
モデルによって直接トリガーされるシーケンシャルとパラレルの生成を動的に切り替える専用のスケジューラを備えています。
1Kの例で3時間の微調整を行った後、私たちの多元宇宙-32Bは、それぞれ同じスケールの主要なAR-LLMと同等のパフォーマンスを達成する唯一のオープンソースの非ARモデルとして、それぞれ54％と46％の54％と46％のスコアによって証明されます。
さらに、当社の予算管理実験は、マルチバース-32Bが優れたスケーリングを示し、同じコンテキスト長を使用して平均で1.87％を上回るAR-llMを上回ることを示しています。
このようなスケーリングはさらに実用的な効率の向上につながり、さまざまなバッチサイズで最大2倍の高速化を達成します。
データ、モデルの重み、エンジン、サポートツール、完全なデータキュレーションのプロンプト、詳細なトレーニングと評価レシピなど、多元宇宙エコシステム全体をオープンソースしました。

要約(オリジナル)

Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.

arxiv情報

著者	Xinyu Yang,Yuwei An,Hongyi Liu,Tianqi Chen,Beidi Chen
発行日	2025-06-11 17:59:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

EVINET: Towards Open-World Graph Learning via Evidential Reasoning Network

投稿日: 2025年6月12日作成者: jarxiv

要約

グラフ学習は多くの現実世界のタスクにとって非常に重要でしたが、それらはしばしば閉じた世界の仮定で研究され、すべての可能なデータのラベルが先験的に知られています。
オープンでノイズの多い環境で効果的なグラフ学習を可能にするために、モデルが既知のクラスの分配データを間違った予測を行う場合、つまり誤分類の検出、またはモデルが新しいクラスからの分散分布、つまり分配不足の検出に遭遇する場合、モデルユーザーに通知することが重要です。
このペーパーでは、主観的ロジックフレームワーク内にベータ埋め込みを統合することにより、これら2つの課題に対処するフレームワークであるEvididential Reasoning Network（Evinet）を紹介します。
Evinetには、2つの重要なモジュールが含まれています。誤分類の検出のための不協和音の推論と、分散除外検出のための空白の推論。
広範な実験は、Evinetが分散分類、誤分類の検出、および分散除外検出のタスクにおける複数のメトリックにわたって最先端の方法を上回ることを示しています。
Evinetは、不確実性の推定と誤分類の検出と分散型検出のための論理的推論の必要性を実証し、オープンワールドグラフ学習の道を開きます。
私たちのコードとデータは、https：//github.com/ssskj/evinetで入手できます。

要約(オリジナル)

Graph learning has been crucial to many real-world tasks, but they are often studied with a closed-world assumption, with all possible labels of data known a priori. To enable effective graph learning in an open and noisy environment, it is critical to inform the model users when the model makes a wrong prediction to in-distribution data of a known class, i.e., misclassification detection or when the model encounters out-of-distribution from novel classes, i.e., out-of-distribution detection. This paper introduces Evidential Reasoning Network (EVINET), a framework that addresses these two challenges by integrating Beta embedding within a subjective logic framework. EVINET includes two key modules: Dissonance Reasoning for misclassification detection and Vacuity Reasoning for out-of-distribution detection. Extensive experiments demonstrate that EVINET outperforms state-of-the-art methods across multiple metrics in the tasks of in-distribution classification, misclassification detection, and out-of-distribution detection. EVINET demonstrates the necessity of uncertainty estimation and logical reasoning for misclassification detection and out-of-distribution detection and paves the way for open-world graph learning. Our code and data are available at https://github.com/SSSKJ/EviNET.

arxiv情報

著者	Weijie Guan,Haohui Wang,Jian Kang,Lihui Liu,Dawei Zhou
発行日	2025-06-11 17:59:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

Reasoning Language Models: A Blueprint

投稿日: 2025年6月12日作成者: jarxiv

要約

OpenaiのO1およびO3、DeepSeek-R1、AlibabaのQWQなど、LLMSを再定義することにより、LLMを高度な推論メカニズムで拡張することにより、OpenaiのO1およびO3、Deepseek-R1、AlibabaのQWQなど、大きな推論モデル（LRMS）とも呼ばれる推論言語モデル（RLMS）があります。
しかし、それらの高コスト、独自の性質、複雑なアーキテクチャ – 強化学習（RL）、検索ヒューリスティック、およびLLMをユニークに組み合わせて、アクセシビリティとスケーラビリティの課題を提示します。
これらに対処するために、すべてのRLM作業の調査と分析に基づいて、RLMコンポーネントをモジュラーフレームワークに整理する包括的な青写真を提案します。
この青写真には、多様な推論構造（チェーン、ツリー、グラフ、ネストされたフォーム）、推論戦略（例：モンテカルロツリー検索、ビーム検索）、RLコンセプト（ポリシー、価値モデルなど）、監督スキーム（結果ベースおよびプロセスベースの監督）、およびその他の関連概念（E.G.、テストタイム、レクリット、レクリティ、レクリティ、レクリティ、レトリエルの概念）が組み込まれています。
また、RLMの実装を簡素化するために、詳細な数学的定式化とアルゴリズム仕様も提供します。
Llama-Berry、QWQ、Journey Learning、Graphのようなスキームが特別なケースに適合する方法を示すことにより、青写真の汎用性と統一可能性を示します。
そのユーティリティを説明するために、迅速なRLMプロトタイピングと実験のためのモジュラー実装であるX1を導入します。
X1と文献レビューを使用して、ポリシーモデルと価値モデルのための多相トレーニング、馴染みのあるトレーニング分布の重要性など、重要な洞察を提供します。
最後に、スケーラブルなRLMクラウドの展開について説明し、RLMがより広範なLLMエコシステムと統合する方法を概説します。
私たちの仕事は、RLMの建設を分かりやすく、高度な推論能力を民主化し、RLMの設計と実験の障壁を下げることにより、「リッチAI」と「貧しいAI」のギャップを軽減することを目指して、イノベーションを促進します。

要約(オリジナル)

Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3, DeepSeek-R1, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures – uniquely combining reinforcement learning (RL), search heuristics, and LLMs – present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between ‘rich AI’ and ‘poor AI’ by lowering barriers to RLM design and experimentation.

arxiv情報

著者	Maciej Besta,Julia Barth,Eric Schreiber,Ales Kubicek,Afonso Catarino,Robert Gerstenberger,Piotr Nyczyk,Patrick Iff,Yueling Li,Sam Houliston,Tomasz Sternal,Marcin Copik,Grzegorz Kwaśniewski,Jürgen Müller,Łukasz Flis,Hannes Eberhard,Zixuan Chen,Hubert Niewiadomski,Torsten Hoefler
発行日	2025-06-11 13:19:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

LLM2TEA: Agentic AI Designer Finds Innovative Objects with Generative Evolutionary Multitasking

投稿日: 2025年6月12日作成者: jarxiv

要約

このホワイトペーパーでは、複数のドメインからのデザインのデザインの相乗効果を促進する生成的進化マルチタスク（GEM）フレームワーク内の最初のエージェントAIデザイナーであるLLM駆動型マルチタスク進化アルゴリズム（LLM2TEA）を紹介します。
特に興味深いのは、革新的であるだけでなく、科学と工学における現実の世界の物理的仕様にも適合するオブジェクトの発見です。
LLM2TEAは、関心のあるオブジェクトを記述する遺伝子型の集団（テキストプロンプトで定義）、これらのプロンプトから表現型を生成するテキストから3Dの生成モデル、オブジェクトのセマンティック表現を解釈する分類器、物理的特性を評価するための物理シミュレーションモデルを初期化するための大きな言語モデルで構成されています。
いくつかの新しいLLMベースのマルチタスク進化オペレーターを提案して、高性能の実用的なオブジェクトの発見に向けて検索を導きます。
概念設計最適化の実験結果は、LLM2TEAの有効性を検証し、現在のテキストから3D生成モデルのベースラインと比較して、革新的なオブジェクトの多様性を97 \％から174 \％の改善から明らかにします。
さらに、生成されたデザインの73％以上が、ベースラインで生成されたデザインの上位1パーセンタイルよりも優れた身体性能を持っています。
さらに、LLM2TEAは、審美的に創造的であるだけでなく、実際のアプリケーションでも機能的なデザインを生成します。
これらの設計のいくつかは、3Dプリントに成功しており、提案されたアプローチのAI生成出力を有形の物理オブジェクトに変換する能力を強調しています。
LLM2TEAが作成した設計は、創造的で革新的な機能を紹介しながら、実用的な要件を満たしており、複雑な設計の最適化と発見における潜在的なアプリケーションを強調しています。

要約(オリジナル)

In this paper, we introduce LLM-driven MultiTask Evolutionary Algorithm (LLM2TEA), the first agentic AI designer within a generative evolutionary multitasking (GEM) framework that promotes the crossover and synergy of designs from multiple domains, leading to innovative solutions that transcend individual disciplines. Of particular interest is the discovery of objects that are not only innovative but also conform to the physical specifications of the real world in science and engineering. LLM2TEA comprises a large language model to initialize a population of genotypes (defined by text prompts) describing the objects of interest, a text-to-3D generative model to produce phenotypes from these prompts, a classifier to interpret the semantic representations of the objects, and a physics simulation model to assess their physical properties. We propose several novel LLM-based multitask evolutionary operators to guide the search toward the discovery of high-performing practical objects. Experimental results in conceptual design optimization validate the effectiveness of LLM2TEA, revealing from 97\% to 174\% improvement in the diversity of innovative objects compared to the present text-to-3D generative model baseline. In addition, more than 73\% of the generated designs have better physical performance than the top 1\% percentile of the designs generated in the baseline. Moreover, LLM2TEA generates designs that are not only aesthetically creative but also functional in real-world applications. Several of these designs have been successfully 3D-printed, emphasizing the proposed approach’s capacity to transform AI-generated outputs into tangible physical objects. The designs produced by LLM2TEA meets practical requirements while showcasing creative and innovative features, underscoring its potential applications in complex design optimization and discovery.

arxiv情報

著者	Melvin Wong,Jiao Liu,Thiago Rios,Stefan Menzel,Yew Soon Ong
発行日	2025-06-11 13:19:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG, cs.NE | コメントを受け付けていません

Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

投稿日: 2025年6月12日作成者: jarxiv

要約

長期曝露（PE）療法は、心的外傷後ストレス障害（PTSD）の効果的な治療法ですが、セッション記録の手動レビューの必要性があるため、セラピストの忠実度を評価することは労働集約型のままです。
セッションオーディオとトランスクリプトから直接、開始時間と停止時間を特定する主要なPE忠実度要素の自動時間局在の方法を提示します。
私たちのアプローチは、低ランクの適応（LORA）を使用して、オーディオ転写入力の30秒の焦点を処理するために、低ランク適応（LORA）を使用して、大規模な訓練を受けたオーディオ言語モデルQWEN2-AUDIOを微調整します。
3つのコアプロトコルフェーズのフィデリティラベル – セラピスト志向（P1）、想像力曝露（P2）、および象徴的処理（P3） – は、LLMベースのプロンプトを介して生成され、訓練を受けた評価者によって検証されます。
このモデルは、タスク固有のプロンプトによって導かれたソフト監督を使用して、正規化された境界オフセットを予測するようにトレーニングされています。
313の実際のPEセッションのデータセットでは、最適な構成（LORAランク8、30Sウィンドウ）は、タスク全体で5.3秒の平均絶対誤差（MAE）を達成します。
さらに、ウィンドウサイズとロラランクの効果を分析し、コンテキストの粒度とモデル適応の重要性を強調します。
この作業では、PE療法における忠実度追跡のためのスケーラブルなフレームワークを紹介し、臨床医の訓練、監督、品質保証をサポートする可能性があります。

要約(オリジナル)

Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements — identifying their start and stop times — directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases — therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) — are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.

arxiv情報

著者	Suhas BN,Andrew M. Sherrill,Jyoti Alaparthi,Dominik Mattioli,Rosa I. Arriaga,Chris W. Wiese,Saeed Abdullah
発行日	2025-06-11 13:21:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: 68T07, cs.CL, cs.HC, eess.AS, H.5.2 | コメントを受け付けていません

Guidelines for Fine-grained Sentence-level Arabic Readability Annotation

投稿日: 2025年6月12日作成者: jarxiv

要約

この論文では、アラビア語での微細に粒の文レベルの読みやすさの評価のための大規模なリソースであるバランスのとれたアラビア語の読みやすさ評価コーパス（Barec）の注釈ガイドラインを紹介します。
Barecには、幼稚園から大学院まで、19レベルにわたってラベル付けされた69,441文（1m以上の単語）が含まれています。
Taha/Arabi21フレームワークに基づいて、ガイドラインは、アラビア語を話す先住民の教育者との反復トレーニングを通じて洗練されました。
読みやすさを決定する際の重要な言語、教育学的、および認知的要因を強調し、高アノテーター間契約を報告します。
また、複数の分類粒度（19、7、5、および3レベル）にわたって自動読み取り可能性モデルをベンチマークします。
コーパスとガイドラインは公開されています。

要約(オリジナル)

This paper presents the annotation guidelines of the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale resource for fine-grained sentence-level readability assessment in Arabic. BAREC includes 69,441 sentences (1M+ words) labeled across 19 levels, from kindergarten to postgraduate. Based on the Taha/Arabi21 framework, the guidelines were refined through iterative training with native Arabic-speaking educators. We highlight key linguistic, pedagogical, and cognitive factors in determining readability and report high inter-annotator agreement: Quadratic Weighted Kappa 81.8% (substantial/excellent agreement) in the last annotation phase. We also benchmark automatic readability models across multiple classification granularities (19-, 7-, 5-, and 3-level). The corpus and guidelines are publicly available.

arxiv情報

著者	Nizar Habash,Hanada Taha-Thomure,Khalid N. Elmadani,Zeina Zeino,Abdallah Abushmaes
発行日	2025-06-11 13:30:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning

投稿日: 2025年6月12日作成者: jarxiv

要約

Voyagerなどの大規模な言語モデル（LLMS）を搭載した具体化されたエージェントは、Minecraftなどの世界のオープンエンドの能力を約束します。
ただし、Open-Weight LLMを搭載している場合、ドメイン固有の微調整後も基本タスクを抑えます。
私たちは、明示的な視点での文化的生涯学習のための生成エージェントフレームワークであるMindforgeを提案します。
3つの重要な革新を紹介します。（1）知覚、信念、欲求、行動をリンクする心の表現の構造化理論。
（2）自然なエージェント間の通信。
（3）マルチコンポーネントメモリシステム。
文化的学習の枠組みに従って、Minecraft内の有益な設定と共同設定の両方でMindforgeをテストします。
GPT-4を使用した有益な設定では、オープンウェイトLLMを搭載したMindforgeエージェントは、Voyagerのカウンターパートを基本的なタスクで大幅に上回り、$ 3 \ Times $ MILESTONESを生み出し、Voyager Baselineよりも2.3 \ Times $のユニークなアイテムを収集します。
さらに、完全\ TextIT {Collaborative}設定では、2人の不十分なエージェントのパフォーマンスがより多くの通信ラウンドで向上し、Condorcet ju審員の定理をエコーすることがわかります。
Mindforgeのエージェントは、蓄積された文化的経験を通じて、専門家の知識移転、協力的な問題解決、および分散排出タスクへの適応など、洗練された行動を実証しています。

要約(オリジナル)

Embodied agents powered by large language models (LLMs), such as Voyager, promise open-ended competence in worlds such as Minecraft. However, when powered by open-weight LLMs they still falter on elementary tasks after domain-specific fine-tuning. We propose MindForge, a generative-agent framework for cultural lifelong learning through explicit perspective taking. We introduce three key innovations: (1) a structured theory of mind representation linking percepts, beliefs, desires, and actions; (2) natural inter-agent communication; and (3) a multi-component memory system. Following the cultural learning framework, we test MindForge in both instructive and collaborative settings within Minecraft. In an instructive setting with GPT-4, MindForge agents powered by open-weight LLMs significantly outperform their Voyager counterparts in basic tasks yielding $3\times$ more tech-tree milestones and collecting $2.3\times$ more unique items than the Voyager baseline. Furthermore, in fully \textit{collaborative} settings, we find that the performance of two underachieving agents improves with more communication rounds, echoing the Condorcet Jury Theorem. MindForge agents demonstrate sophisticated behaviors, including expert-novice knowledge transfer, collaborative problem solving, and adaptation to out-of-distribution tasks through accumulated cultural experiences.

arxiv情報

著者	Mircea Lică,Ojas Shirekar,Baptiste Colle,Chirag Raman
発行日	2025-06-11 14:09:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

投稿日: 2025年6月12日作成者: jarxiv

要約

AIが生成したコンテンツは、モノリシックモデルからモジュラーワークフロー、特にComfyuiなどのプラットフォームに進化し、クリエイティブパイプラインでのカスタマイズを可能にします。
ただし、効果的なワークフローを作成するには、多数の専門化されたコンポーネントを調整するために優れた専門知識が必要であり、ユーザーに急な学習曲線を提示します。
この課題に対処するために、自動化されたワークフロー生成の最初の大きな推論モデルであるComfyui-R1を紹介します。
4Kワークフローのキュレーションされたデータセットから始めて、ノード選択、ワークフロー計画、コードレベルのワークフロー表現など、長い考え方（COT）の推論データを構築します。
Comfyui-R1は、2段階のフレームワークを通じてトレーニングされています。（1）コットスタートのための微調整、モデルをComfyuiドメインに適合させます。
（2）微調整されたルールメトリックハイブリッド報酬に導かれ、形式の有効性、構造的完全性、およびノードレベルの忠実度を確保するための推論能力を奨励するための強化学習。
実験では、7Bパラメーターモデルが97 \％形式の有効性レートを達成し、高いパスレート、ノードレベル、グラフレベルのF1スコアとともに、GPT-4OやClaudeシリーズなどの主要なクローズドソースモデルを採用する以前の最先端の方法を大幅に上回っています。
さらなる分析では、推論プロセスの重要な役割と、ワークフローをコードに変換する利点を強調しています。
定性的比較により、多様なノードを使用した複雑なワークフローを合成することの強みが明らかになり、AIアート作成における長いCOT推論の可能性を強調しています。

要約(オリジナル)

AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.

arxiv情報

著者	Zhenran Xu,Yiyu Wang,Xue Yang,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang
発行日	2025-06-11 14:35:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV, cs.SE | コメントを受け付けていません

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

投稿日: 2025年6月12日作成者: jarxiv

要約

テストがどのように教育評価において項目に答えるかを知ることは、テスト開発、アイテムの品質を評価し、テストの妥当性を改善するために不可欠です。
ただし、このプロセスには通常、人間の参加者との広範なパイロット研究が必要です。
大規模な言語モデル（LLM）がテスト項目に人間のような反応行動を示す場合、これはパイロット参加者としてそれらを使用してテスト開発を加速する可能性を開く可能性があります。
このホワイトペーパーでは、18の命令チューニングLLMSからの応答の人間性または心理測定の妥当性を、3つの科目で複数選択テスト項目の2つの公開されたデータセットを使用して、読書、米国の歴史、経済学を評価します。
私たちの方法論は、教育評価、古典的なテスト理論、アイテム応答理論で一般的に使用される精神測量からの2つの理論的枠組みに基づいています。
結果は、より大きなモデルは過度に自信を持っていますが、温度スケーリングで較正されると、反応分布がより人間のようになる可能性があることを示しています。
さらに、LLMは、他の被験者と比較して、読解項目の人間とよりよく相関する傾向があることがわかります。
ただし、相関関係は全体的にそれほど強力ではなく、LLMがゼロショット設定で教育評価を試験するために使用されるべきではないことを示しています。

要約(オリジナル)

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

arxiv情報

著者	Andreas Säuberli,Diego Frassinelli,Barbara Plank
発行日	2025-06-11 14:41:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント