jarxiv | Japanese arxiv | ページ 598

GASCADE: Grouped Summarization of Adverse Drug Event for Enhanced Cancer Pharmacovigilance

投稿日: 2025年5月8日作成者: jarxiv

要約

癌治療の領域では、処方された薬物を使用している患者によって報告された有害な薬物イベント（ADE）を要約することは、薬物存在の実践を強化し、薬物関連の意思決定を改善するために重要です。
薬局性データの量と複雑さは増加していますが、この分野の既存の研究は、主に癌に対処するのではなく、一般的な疾患に焦点を当てています。
この研究では、癌治療のために同じ薬物を使用している複数の患者によって報告された有害薬物イベントのグループ化された要約のタスクを紹介します。
がんの薬物存在における限られた資源の課題に対処するために、多重癌の有害薬物反応と要約（MCADRS）データセットを提示します。
このデータセットには、薬物の有効性と副作用に関する患者の懸念を詳述する薬局性の投稿と、薬物名、有害な薬物イベント、重症度、反応の逆境、および各薬物のADEの概要が含まれています。
さらに、大規模な言語モデル（LLM）の情報抽出能力をエンコーダデコーダーT5モデルの要約能力と組み合わせた新しいパイプラインである、がん有害薬物イベント（ガスケード）フレームワークのグループ化と抽象的な要約を提案します。
私たちの作業は、要約タスクの合成データセットを使用して、直接選好最適化などの高度なアルゴリズムを含む、エンコーダデコーダーモデルを含むアライメント手法を最初に適用します。
広範な実験を通じて、自動化された評価と人間の評価の両方を通じて検証されたさまざまなメトリックにわたるガスケードの優れたパフォーマンスを実証します。
このマルチタスクアプローチは、薬物関連の意思決定を促進し、患者の懸念をより深く理解し、パーソナライズされた反応性のあるがんケアの進歩への道を開いています。
この作業で使用されているコードとデータセットは公開されています。

要約(オリジナル)

In the realm of cancer treatment, summarizing adverse drug events (ADEs) reported by patients using prescribed drugs is crucial for enhancing pharmacovigilance practices and improving drug-related decision-making. While the volume and complexity of pharmacovigilance data have increased, existing research in this field has predominantly focused on general diseases rather than specifically addressing cancer. This work introduces the task of grouped summarization of adverse drug events reported by multiple patients using the same drug for cancer treatment. To address the challenge of limited resources in cancer pharmacovigilance, we present the MultiLabeled Cancer Adverse Drug Reaction and Summarization (MCADRS) dataset. This dataset includes pharmacovigilance posts detailing patient concerns regarding drug efficacy and adverse effects, along with extracted labels for drug names, adverse drug events, severity, and adversity of reactions, as well as summaries of ADEs for each drug. Additionally, we propose the Grouping and Abstractive Summarization of Cancer Adverse Drug events (GASCADE) framework, a novel pipeline that combines the information extraction capabilities of Large Language Models (LLMs) with the summarization power of the encoder-decoder T5 model. Our work is the first to apply alignment techniques, including advanced algorithms like Direct Preference Optimization, to encoder-decoder models using synthetic datasets for summarization tasks. Through extensive experiments, we demonstrate the superior performance of GASCADE across various metrics, validated through both automated assessments and human evaluations. This multitasking approach enhances drug-related decision-making and fosters a deeper understanding of patient concerns, paving the way for advancements in personalized and responsive cancer care. The code and dataset used in this work are publicly available.

arxiv情報

著者	Sofia Jamil,Aryan Dabad,Bollampalli Areen Reddy,Sriparna Saha,Rajiv Misra,Adil A. Shakur
発行日	2025-05-07 09:40:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

投稿日: 2025年5月8日作成者: jarxiv

要約

トランスフォーマーは多数のNLPタスクで大きな成功を収めていますが、特に実際の知識がまばらである場合、多段階の事実上の推論で顕著なギャップを示し続けています。
グローキングの最近の進歩は、神経ネットワークが基礎となる論理パターンを検出すると、記憶から完全な一般化に移行できることを実証していますが、これらの研究は主に小さな合成タスクを使用しています。
この論文では、初めて、グローキングを実際の事実データに拡張し、既存の知識グラフを慎重に設計した合成データで既存の知識グラフを増強することにより、グローキングに必要な原子事実と推定された事実の比率$ \ phi_r $を上昇させることにより、データセットスパースの課題に対処します。
驚くべきことに、事実に誤った合成データでさえ、モデルが暗記ではなくリレーショナル構造に依存するように強制するため、精度を低下させるのではなく、緊急の推論回路を強化できることがわかります。
マルチホップ推論ベンチマークで評価されると、私たちのアプローチは2Wikimultihopqaで最大95〜100％の精度を達成します。
さらに、$ \ phi_r $の増加が変圧器内の一般化回路の形成をどのように駆動するかについての詳細な分析を提供します。
私たちの調査結果は、グローキングベースのデータ増強が暗黙のマルチホップ推論機能のロックを解除し、大規模な言語モデルにおけるより堅牢で解釈可能な事実上の推論への扉を開くことができることを示唆しています。

要約(オリジナル)

Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns – yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA – substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.

arxiv情報

著者	Roman Abramov,Felix Steinbauer,Gjergji Kasneci
発行日	2025-05-07 09:47:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG, I.2.3 | コメントを受け付けていません

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

投稿日: 2025年5月8日作成者: jarxiv

要約

Deepseek-R1は、長いチェーン（COT）の推論が、ルールベースの報酬を備えた単純な強化学習（RL）フレームワークを通じて自然に出現することを示しています。ここでは、ベースモデルから直接開始される可能性があります。
ゼロRLトレーニングを再現するための最近の取り組みは、主にQWEN2.5モデルシリーズに焦点を当てています。これは、基本モデルがすでに強力な指導に従う能力と自己反省能力を示しているため、代表的ではないかもしれません。
この作業では、LLAMA3-8B、Mistral-7B/24B、DeepSeek-Math-7B、QWEN2.5-MATH-7B、および0.5Bから32BのすべてのQWEN2.5モデルなど、さまざまなファミリーやサイズにまたがる10の多様なベースモデルでゼロRLトレーニングを調査します。
いくつかの主要な設計戦略を活用して、調整形式の報酬とクエリの難易度を制御するなど、ほとんどの設定にわたって推論の精度と応答長の両方が大幅に改善されます。
ただし、トレーニングダイナミクスを慎重に監視することにより、さまざまなベースモデルがトレーニング中に異なるパターンを示すことがわかります。
たとえば、応答長の増加は、検証（つまり、「AHA瞬間」）などの特定の認知行動の出現と常に相関するとは限りません。
特に、Qwenファミリーからではなく、小さなモデルで初めて「Ahaの瞬間」を観察します。
成功したゼロRLトレーニングを可能にする重要なデザインと、調査結果と実践を共有しています。
さらなる研究を容易にするために、コード、モデル、分析ツールをオープンソーシングします。

要約(オリジナル)

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the ‘aha moment’). Notably, we observe the ‘aha moment’ for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

arxiv情報

著者	Weihao Zeng,Yuzhen Huang,Qian Liu,Wei Liu,Keqing He,Zejun Ma,Junxian He
発行日	2025-05-07 09:57:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

Can LLM be a Good Path Planner based on Prompt Engineering? Mitigating the Hallucination for Path Planning

投稿日: 2025年5月8日作成者: jarxiv

要約

大規模な言語モデル（LLMS）の空間的推論は、具体化された知性の基盤です。
However, even in simple maze environments, LLMs still encounter challenges in long-term path-planning, primarily influenced by their spatial hallucination and context inconsistency hallucination by long-term reasoning.
この課題に対処するために、この研究では、革新的なモデル、空間的な変換、カリキュラムQラーニング（S2RCQL）を提案しています。
To address the spatial hallucination of LLMs, we propose the Spatial-to-Relational approach, which transforms spatial prompts into entity relations and paths representing entity relation chains.
このアプローチは、順次思考の観点からLLMの可能性を完全にタップします。
As a result, we design a path-planning algorithm based on Q-learning to mitigate the context inconsistency hallucination, which enhances the reasoning ability of LLMs.
Using the Q-value of state-action as auxiliary information for prompts, we correct the hallucinations of LLMs, thereby guiding LLMs to learn the optimal path.
最後に、LLMSに基づいた逆カリキュラム学習手法を提案して、コンテキストの不一致の幻覚をさらに軽減します。
LLMSは、タスクの難易度を軽減し、それらを活用してより複雑なタスクに取り組むことにより、成功した体験を急速に蓄積できます。
Baiduの自己開発LLM：Ernie-Bot 4.0に基づいて包括的な実験を行いました。
The results showed that our S2RCQL achieved a 23%–40% improvement in both success and optimality rates compared with advanced prompt engineering.

要約(オリジナル)

Spatial reasoning in Large Language Models (LLMs) is the foundation for embodied intelligence. However, even in simple maze environments, LLMs still encounter challenges in long-term path-planning, primarily influenced by their spatial hallucination and context inconsistency hallucination by long-term reasoning. To address this challenge, this study proposes an innovative model, Spatial-to-Relational Transformation and Curriculum Q-Learning (S2RCQL). To address the spatial hallucination of LLMs, we propose the Spatial-to-Relational approach, which transforms spatial prompts into entity relations and paths representing entity relation chains. This approach fully taps the potential of LLMs in terms of sequential thinking. As a result, we design a path-planning algorithm based on Q-learning to mitigate the context inconsistency hallucination, which enhances the reasoning ability of LLMs. Using the Q-value of state-action as auxiliary information for prompts, we correct the hallucinations of LLMs, thereby guiding LLMs to learn the optimal path. Finally, we propose a reverse curriculum learning technique based on LLMs to further mitigate the context inconsistency hallucination. LLMs can rapidly accumulate successful experiences by reducing task difficulty and leveraging them to tackle more complex tasks. We performed comprehensive experiments based on Baidu’s self-developed LLM: ERNIE-Bot 4.0. The results showed that our S2RCQL achieved a 23%–40% improvement in both success and optimality rates compared with advanced prompt engineering.

arxiv情報

著者	Hourui Deng,Hongjie Zhang,Jie Ou,Chaosheng Feng
発行日	2025-05-07 10:00:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Benchmarking LLMs’ Swarm intelligence

投稿日: 2025年5月8日作成者: jarxiv

要約

大規模な言語モデル（LLMS）は、複雑な推論の可能性を示していますが、特に群れの知能のニュアンスを伴う自然な群れの修正の特徴である、限られた局所的な認識とコミュニケーションなど、限られた局所的な認識とコミュニケーションなど、厳格な制約の下で動作する場合、マルチエージェントシステム（MAS）における緊急調整の可能性を示しています。
既存のベンチマークは、エージェントが不完全な空間的情報で動作するときに発生する分散型調整の独自の課題を完全に把握しないことがよくあります。
このギャップを埋めるために、Swarmbenchを紹介します。Swarmbenchは、分散型エージェントとして機能するLLMSの群れインテリジェンス能力を体系的に評価するために設計された新しいベンチマークです。
Swarmbenchは、構成可能な2Dグリッド環境内の5つの基礎MAS調整タスクを備えており、エージェントに主にローカル感覚入力（K X Kビュー）とローカル通信に依存します。
調整の有効性のメトリックを提案し、緊急グループのダイナミクスを分析します。
ゼロショット設定でいくつかの主要なLLMを評価すると、タスク全体で大きなパフォーマンスの変動があり、ローカルの情報制約によってもたらされる困難を強調しています。
いくつかの調整が現れますが、結果は、これらの分散型シナリオの不確実性の下での堅牢な計画と戦略形成の制限を示しています。
群れのような条件下でLLMを評価することは、将来の分散型システムでの可能性を実現するために重要です。
swarmbenchを、定義された機械的特性を備えたカスタマイズ可能でスケーラブルな物理システムに基づいて、オープンで拡張可能なツールキットをリリースします。
環境、プロンプト、評価スクリプト、および生成された包括的な実験データセットを提供し、LLMベースのMAS調整と具体化されたMASの理論的基盤に関する再現可能な研究を促進することを目的としています。
当社のコードリポジトリは、https：//github.com/x66cfff/swarmbenchで入手できます。

要約(オリジナル)

Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at https://github.com/x66ccff/swarmbench.

arxiv情報

著者	Kai Ruan,Mowen Huang,Ji-Rong Wen,Hao Sun
発行日	2025-05-07 12:32:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.MA | コメントを受け付けていません

Playing repeated games with Large Language Models

投稿日: 2025年5月8日作成者: jarxiv

要約

LLMは、人間や他のエージェントと相互作用するアプリケーションでますます使用されています。
行動ゲーム理論を使用して、LLMの協力と調整行動を研究することを提案します。
さまざまなLLMが、人間のような戦略と実際の人間のプレーヤーを使用して、互いに$ 2 \ Times2 $ 2ゲームを繰り返し繰り返します。
私たちの結果は、LLMSが反復囚のジレンマファミリーのような利己的なゲームで特にうまく機能することを示しています。
しかし、それらは性別の戦いのように、調整を必要とするゲームで最適に行動します。
これらの行動署名は、堅牢性チェック全体で安定していることを確認します。
さらに、相手に関する追加情報を提供し、「ソーシャルチェーン」（SCOT）戦略を使用することにより、GPT-4の動作をどのように変調できるかを示します。
これはまた、人間のプレイヤーとやり取りするときに、より良いスコアとより成功した調整につながります。
これらの結果は、LLMの社会的行動の理解を豊かにし、機械の行動ゲーム理論への道を開いています。

要約(オリジナル)

LLMs are increasingly used in applications where they interact with humans and other agents. We propose to use behavioural game theory to study LLM’s cooperation and coordination behaviour. We let different LLMs play finitely repeated $2\times2$ games with each other, with human-like strategies, and actual human players. Our results show that LLMs perform particularly well at self-interested games like the iterated Prisoner’s Dilemma family. However, they behave sub-optimally in games that require coordination, like the Battle of the Sexes. We verify that these behavioural signatures are stable across robustness checks. We additionally show how GPT-4’s behaviour can be modulated by providing additional information about its opponent and by using a ‘social chain-of-thought’ (SCoT) strategy. This also leads to better scores and more successful coordination when interacting with human players. These results enrich our understanding of LLM’s social behaviour and pave the way for a behavioural game theory for machines.

arxiv情報

著者	Elif Akata,Lion Schulz,Julian Coda-Forno,Seong Joon Oh,Matthias Bethge,Eric Schulz
発行日	2025-05-07 12:44:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Large Means Left: Political Bias in Large Language Models Increases with Their Number of Parameters

投稿日: 2025年5月8日作成者: jarxiv

要約

人工知能の有病率の増加に伴い、これらの素因がユーザーに与える影響を緩和するための基礎を形成するために、固有のバイアスの慎重な評価を実施する必要があります。
大規模な言語モデル（LLM）は、主に多くの人がさまざまなトピックの主要な情報源として使用しています。
LLMは頻繁に事実上の誤りを犯したり、データを製造したり（幻覚）、または偏見を提示したりして、ユーザーを誤った情報にさらすことと意見に影響を与えます。
幻覚とは異なり、バイアスをデータ検証で捕らえることができないため、ユーザーのリスクについて教育することは責任ある使用の鍵です。
Wahl-O-Matが作成したスコアを使用して、ドイツのバンデタグの最近の投票の文脈において、一般的なLLMの政治的偏見を定量化します。
この指標は、個人の政治的見解とドイツの政党の立場との間の整合を測定します。
モデルのアラインメントスコアを比較して、政治的好みに影響を与える要因を特定します。
そうすることで、私たちは、より大きなLLMで最も支配的な左寄りのパーティーに対するバイアスを発見します。
また、モデルと通信するために使用する言語は、政治的見解に影響を与えることがわかります。
さらに、モデルの起源とリリース日の影響を分析し、結果をBundestagの最近の投票の結果と比較します。
私たちの結果は、LLMが政治的偏見を示す傾向があることを意味します。
したがって、LLMを開発するために必要な手段を備えた大企業は、故意に、または知らないうちに、各投票者の意思決定プロセスに影響を与え、一般的かつ大規模に世論を通知できるため、これらのバイアスを抑える責任があります。

要約(オリジナル)

With the increasing prevalence of artificial intelligence, careful evaluation of inherent biases needs to be conducted to form the basis for alleviating the effects these predispositions can have on users. Large language models (LLMs) are predominantly used by many as a primary source of information for various topics. LLMs frequently make factual errors, fabricate data (hallucinations), or present biases, exposing users to misinformation and influencing opinions. Educating users on their risks is key to responsible use, as bias, unlike hallucinations, cannot be caught through data verification. We quantify the political bias of popular LLMs in the context of the recent vote of the German Bundestag using the score produced by the Wahl-O-Mat. This metric measures the alignment between an individual’s political views and the positions of German political parties. We compare the models’ alignment scores to identify factors influencing their political preferences. Doing so, we discover a bias toward left-leaning parties, most dominant in larger LLMs. Also, we find that the language we use to communicate with the models affects their political views. Additionally, we analyze the influence of a model’s origin and release date and compare the results to the outcome of the recent vote of the Bundestag. Our results imply that LLMs are prone to exhibiting political bias. Large corporations with the necessary means to develop LLMs, thus, knowingly or unknowingly, have a responsibility to contain these biases, as they can influence each voter’s decision-making process and inform public opinion in general and at scale.

arxiv情報

著者	David Exler,Mark Schutera,Markus Reischl,Luca Rettenberger
発行日	2025-05-07 13:18:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

投稿日: 2025年5月8日作成者: jarxiv

要約

トレーニングデータクリーニングは、生成モデルベースの音声修復（SR）の新しいアプリケーションです。
このペーパーでは、大規模な言語モデルなどの大規模生成モデルのデータクリーニングをトレーニングするために、100万時間のスケールデータ向けに設計されたSRモデルであるMiipher-2を紹介します。
対処された主要な課題には、目に見えない言語への一般化、明示的な条件付けのない操作（テキスト、スピーカーIDなど）、および計算効率が含まれます。
Miipher-2は、堅牢で訓練された普遍的な音声モデル（USM）を利用し、300を超える言語を堅牢で調整のない機能抽出器としてサポートします。
効率を最適化し、メモリを最小化するために、Miipher-2には、騒々しい入力からクリーンUSM機能を予測するための並列アダプターが組み込まれ、波形合成にWaneFit Neural Vocoderを使用します。
これらのコンポーネントは、3,000時間の多言語のスタジオ品質の録音を拡張することで訓練されましたが、USMパラメーターは固定されたままでした。
実験結果は、ワードエラーレート、スピーカーの類似性、およびテストされたすべての言語での客観的および主観的な音質スコアの両方で、従来のSRモデルにMiipher-2の優れたパフォーマンスまたは同等のパフォーマンスを示しています。
MIIPHER-2は、消費者グレードの加速器で効率的に動作し、0.0078のリアルタイム係数を達成し、そのような加速器100のみを使用して約3日で100万時間の音声データセットの処理を可能にします。

要約(オリジナル)

Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

arxiv情報

著者	Shigeki Karita,Yuma Koizumi,Heiga Zen,Haruko Ishikawa,Robin Scheibler,Michiel Bacchiani
発行日	2025-05-07 14:27:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.SD, eess.AS | コメントを受け付けていません

OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

投稿日: 2025年5月8日作成者: jarxiv

要約

大規模な言語モデル（LLM）を使用した下流タスクでは、幻覚はしばしば避けられません。
LLMベースのオントロジーマッチング（OM）システムの幻覚に対処するという大幅な課題に取り組むために、OAEI-LLM-Tと呼ばれる新しいベンチマークデータセットを導入します。
データセットは、オントロジーアライメント評価イニシアチブ（OAEI）のTBOX（つまり、スキーママッチング）データセットから進化し、OMタスクを実行するさまざまなLLMの幻覚をキャプチャします。
これらのOM固有の幻覚は、2つの主要なカテゴリと6つのサブカテゴリに慎重に分類されます。
LLMリーダーボードとLLMベースのOMシステム用の微調整基礎LLMを構築する際のデータセットの有用性を紹介します。

要約(オリジナル)

Hallucinations are often inevitable in downstream tasks using large language models (LLMs). To tackle the substantial challenge of addressing hallucinations for LLM-based ontology matching (OM) systems, we introduce a new benchmark dataset called OAEI-LLM-T. The dataset evolves from the TBox (i.e. schema-matching) datasets in the Ontology Alignment Evaluation Initiative (OAEI), capturing hallucinations of different LLMs performing OM tasks. These OM-specific hallucinations are carefully classified into two primary categories and six sub-categories. We showcase the usefulness of the dataset in constructing the LLM leaderboard and fine-tuning foundational LLMs for LLM-based OM systems.

arxiv情報

著者	Zhangcheng Qiang
発行日	2025-05-07 15:02:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.IR | コメントを受け付けていません

Automated Coding of Communications in Collaborative Problem-solving Tasks Using ChatGPT

投稿日: 2025年5月8日作成者: jarxiv

要約

共同問題解決（CPS）は、21世紀の重要なスキルとして広く認識されています。
CPSの評価は、コンストラクト関連フレームワークを使用して通信データのコーディングに大きく依存しており、このプロセスはそのような評価をスケーリングするための主要なボトルネックでした。
5つのデータセットと2つのコーディングフレームワークに基づいて、CHATGPTはCHATGPTモデル間でパフォーマンスが異なり、コーディングフレームワークとタスクの特性に依存しますが、通信データを満足のいくレベルにコーディングできることを実証します。
興味深いことに、GPT-O1-MINIやGPT-O3-MINIなどの新しい推論に焦点を当てたモデルは、必ずしもより良いコーディング結果をもたらすとは限りません。
さらに、このアプローチの有効性はすべてのタスクで一貫していないものの、誤ったケースからのフィードバックに基づいて洗練プロンプトがコーディングの精度を向上させる可能性があることを示しています。
これらの調査結果は、21世紀のスキル評価をサポートするコミュニケーションデータを分析するためのスケーラブルで効率的な方法を開発する際の研究者と実践者に実用的なガイダンスを提供します。

要約(オリジナル)

Collaborative problem solving (CPS) is widely recognized as a critical 21st-century skill. Assessing CPS depends heavily on coding the communication data using a construct-relevant framework, and this process has long been a major bottleneck to scaling up such assessments. Based on five datasets and two coding frameworks, we demonstrate that ChatGPT can code communication data to a satisfactory level, though performance varies across ChatGPT models, and depends on the coding framework and task characteristics. Interestingly, newer reasoning-focused models such as GPT-o1-mini and GPT-o3-mini do not necessarily yield better coding results. Additionally, we show that refining prompts based on feedback from miscoded cases can improve coding accuracy in some instances, though the effectiveness of this approach is not consistent across all tasks. These findings offer practical guidance for researchers and practitioners in developing scalable, efficient methods to analyze communication data in support of 21st-century skill assessment.

arxiv情報

著者	Jiangang Hao,Wenju Cui,Patrick Kyllonen,Emily Kerzabi,Lei Liu,Michael Flor
発行日	2025-05-07 15:14:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.HC | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント