jarxiv | Japanese arxiv | ページ 1536

Predicting clinical outcomes from patient care pathways represented with temporal knowledge graphs

投稿日: 2025年3月3日作成者: jarxiv

要約

背景：ヘルスケアデータの可用性が向上すると、予測モデリングは、さまざまな条件のリスクのレベルの評価など、生物医学ドメインで多くのアプリケーションを見つけ、臨床的意思決定を導くことができます。
ただし、知識グラフのデータ表現とその埋め込みは、一部の設定で競争力があるものが、生物医学的予測モデリングにどのように関心があるかは不明です。
方法：頭蓋内動脈瘤の患者の合成が現実的なデータをシミュレートし、臨床結果を予測するタスクについて実験しました。
同じデータのグラフベースの表現と、表形式データのさまざまな分類アプローチのパフォーマンスを比較しました。
次に、最初の個々のデータと2番目の時間データを表現するための採用されたスキーマが予測パフォーマンスにどのように影響するかを調査しました。
結果：私たちの研究は、私たちの場合、グラフ表現とグラフの畳み込みネットワーク（GCN）の埋め込みが、観察データからの予測タスクの最良のパフォーマンスに達することを示しています。
採用されたスキーマの重要性と、個々のデータの表現におけるリテラル値の考慮の重要性を強調します。
また、私たちの研究は、GCNパフォーマンスに対するさまざまな時間エンコーディングの相対的な影響も緩和されています。

要約(オリジナル)

Background: With the increasing availability of healthcare data, predictive modeling finds many applications in the biomedical domain, such as the evaluation of the level of risk for various conditions, which in turn can guide clinical decision making. However, it is unclear how knowledge graph data representations and their embedding, which are competitive in some settings, could be of interest in biomedical predictive modeling. Method: We simulated synthetic but realistic data of patients with intracranial aneurysm and experimented on the task of predicting their clinical outcome. We compared the performance of various classification approaches on tabular data versus a graph-based representation of the same data. Next, we investigated how the adopted schema for representing first individual data and second temporal data impacts predictive performances. Results: Our study illustrates that in our case, a graph representation and Graph Convolutional Network (GCN) embeddings reach the best performance for a predictive task from observational data. We emphasize the importance of the adopted schema and of the consideration of literal values in the representation of individual data. Our study also moderates the relative impact of various time encoding on GCN performance.

arxiv情報

著者	Jong Ho Jhee,Alberto Megina,Pacôme Constant Dit Beaufils,Matilde Karakachoff,Richard Redon,Alban Gaignard,Adrien Coulet
発行日	2025-02-28 15:20:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG | コメントを受け付けていません

Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning

投稿日: 2025年3月3日作成者: jarxiv

要約

人間は、将来について推論し、反事実を想像し、新しい状況に柔軟に適応するために、世界の豊富な内部モデルを活用します。
強化学習（RL）では、世界モデルは、エージェントの行動に応じて環境がどのように進化するかを捉え、計画と一般化を促進することを目的としています。
ただし、典型的な世界モデルは、環境変数（ピクセル、物理的属性など）で直接動作し、トレーニングをゆっくりと扱いにくくすることができます。
代わりに、関連するマルチモーダル変数をキャプチャする高レベルの潜在寸法に依存することが有利かもしれません。
Global Workspace（GW）Theoryは、脳内のマルチモーダル統合と情報放送のための認知フレームワークを提供し、最近の研究では、GWの効率的な深い学習実装を導入し始めました。
ここでは、GWとワールドモデルを組み合わせたRLシステムの機能を評価します。
GW-Dreamerを標準のPPOおよび元のDreamerアルゴリズムのさまざまなバージョンと比較します。
GW潜在スペース内で夢のプロセス（つまり、精神シミュレーション）を実行することで、環境ステップが少ないトレーニングが可能になることを示します。
追加の緊急特性として、結果のモデル（その比較ベースラインではありません）は、その観測モダリティの1つ（画像またはシミュレーション属性）がないことに強い堅牢性を示します。
GWと世界モデルの組み合わせは、RLエージェントの意思決定を改善する大きな可能性を秘めていると結論付けています。

要約(オリジナル)

Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent’s actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.

arxiv情報

著者	Léopold Maytié,Roland Bertin Johannet,Rufin VanRullen
発行日	2025-02-28 15:24:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, q-bio.NC | コメントを受け付けていません

Grams: Gradient Descent with Adaptive Momentum Scaling for Training Large Language Models

投稿日: 2025年3月3日作成者: jarxiv

要約

$ \ mathbf {g} $の放射降下$ \ mathbf {a} $ daptive $ \ mathbf {m} $ \ mathbf {s} $ caling（$ \ mathbf {grams} $）、斬新な最適化アルゴリットは、ディープイングレーションを想定しており、マグインをデコールします。
モメンタムをアップデートに直接統合する従来のオプティマザーとは異なり、グラムは、現在の勾配から派生した更新方向を、適応マグニチュードスケーリングのみに使用するために使用されるモメンタムから分離します。
このアプローチにより、GRAMSは最先端の慎重で勢いベースのオプティマイザーと比較して、改善された損失降下を実現できます。
理論的には、グラムが他の最先端のオプティマイザーよりも速く下降することを実証し、グラムのグローバルな収束保証を確立します。
また、広範な経験的評価を通じてその有効性を検証します。
結果は、Adam、Lion、およびその慎重なバリアントなどの広く使用されているオプティマザーと比較して、より速い収束やより良い一般化など、グラムの優れたパフォーマンスを示しています。
私たちの結果は、大規模な言語モデルを効率的にトレーニングするための変革的アプローチとしてのグラムの可能性を強調しています。
コードは$ \ href {https://github.com/gunale0926/grams} {\ text {https://github.com/gunale0926/grams}}}で入手できます。

要約(オリジナル)

We introduce $\mathbf{G}$radient Descent with $\mathbf{A}$daptive $\mathbf{M}$omentum $\mathbf{S}$caling ($\mathbf{Grams}$), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams. We also validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams’ superior performance, including faster convergence and better generalization, compared to widely-used optimizers such as Adam, Lion, and their cautious variants. Our results highlight Grams’ potential as a transformative approach for efficiently training large language models. Code is available at $\href{https://github.com/Gunale0926/Grams}{\text{https://github.com/Gunale0926/Grams}}$.

arxiv情報

著者	Yang Cao,Xiaoyu Li,Zhao Song
発行日	2025-02-28 15:31:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.DS, cs.LG, math.OC | コメントを受け付けていません

LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule Engine

投稿日: 2025年3月3日作成者: jarxiv

要約

ノルウェーのCancer Registry（CRN）は、自動癌登録サポートシステム（CARESS）を使用して、コアがん登録活動、つまりデータキャプチャ、データキュレーション、およびさまざまな利害関係者のデータ製品と統計の生産をサポートしています。
Guriは愛careのコアコンポーネントであり、医療規則を使用して着信データを検証する責任があります。
このような医療規則は、医療基準、規制、および研究に基づいて医療専門家によって手動で実施されています。
大規模な言語モデル（LLM）は、これらの文書を含む大量の公開情報について訓練されているため、Guriのテストを生成するために使用できます。
したがって、GURIをテストするために、LLMベースのテスト生成および微分テストアプローチ（LLMediff）を提案します。
4つの異なるLLM、2つの医療ルールエンジンの実装、および58の実際の医療ルールを実験して、LLMの幻覚、成功、時間効率、およびテストを生成するための堅牢性を調査しました。
私たちの結果は、GPT-3.5が最も成功しておらず、一般的に最も堅牢であることをGPT-3.5が幻覚を最も少なくしていることを示しました。
ただし、最悪の時間効率があります。
当社の鑑別テストにより、実装の不一致が発見された22の医療ルールが明らかになりました（例：ルールバージョンの処理に関して）。
最後に、結果に基づいて開業医と研究者に洞察を提供します。

要約(オリジナル)

The Cancer Registry of Norway (CRN) uses an automated cancer registration support system (CaReSS) to support core cancer registry activities, i.e, data capture, data curation, and producing data products and statistics for various stakeholders. GURI is a core component of CaReSS, which is responsible for validating incoming data with medical rules. Such medical rules are manually implemented by medical experts based on medical standards, regulations, and research. Since large language models (LLMs) have been trained on a large amount of public information, including these documents, they can be employed to generate tests for GURI. Thus, we propose an LLM-based test generation and differential testing approach (LLMeDiff) to test GURI. We experimented with four different LLMs, two medical rule engine implementations, and 58 real medical rules to investigate the hallucination, success, time efficiency, and robustness of the LLMs to generate tests, and these tests’ ability to find potential issues in GURI. Our results showed that GPT-3.5 hallucinates the least, is the most successful, and is generally the most robust; however, it has the worst time efficiency. Our differential testing revealed 22 medical rules where implementation inconsistencies were discovered (e.g., regarding handling rule versions). Finally, we provide insights for practitioners and researchers based on the results.

arxiv情報

著者	Erblin Isaku,Christoph Laaber,Hassan Sartaj,Shaukat Ali,Thomas Schwitalla,Jan F. Nygård
発行日	2025-02-28 15:33:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.SE | コメントを受け付けていません

Disentangling Uncertainty for Safe Social Navigation using Deep Reinforcement Learning

投稿日: 2025年3月3日作成者: jarxiv

要約

安全なナビゲーションと適切な人間の相互作用が非常に重要な歩行者が豊富な環境では、自律的なモバイルロボットがますます使用されています。
ディープ強化学習（DRL）は、社会的に統合されたロボットの行動を可能にしますが、斬新なシナリオまたは摂動シナリオには、ポリシーがいつ、なぜ不確実であるかを示すための課題が持続します。
意思決定における未知の不確実性は、衝突や人間の不快感につながる可能性があり、安全でリスクを認識したナビゲーションが依然として開かれた問題である理由の1つです。
この作業では、ポリシー分布の不確実性の推定値のために、アレアトリック、認識論、予測不確実性の推定をDRLナビゲーションフレームワークに統合する新しいアプローチを紹介します。
したがって、観測依存分散（ODV）とドロップアウトを近位政策最適化（PPO）アルゴリズムに組み込みます。
さまざまなタイプの摂動について、深いアンサンブルとモンテカルロドロップアウト（MCドロップアウト）の能力を比較して、ポリシーの不確実性を推定します。
不確実な意思決定状況では、ロボットの社会的行動を保守的な衝突回避に変更することを提案します。
結果は、PPOでのODVとドロップアウトによるトレーニングパフォーマンスの改善を示しており、トレーニングシナリオが一般化に影響を与えることを明らかにしています。
さらに、MCドロップアウトは摂動に対してより敏感であり、不確実性の種類をよりよく相関させます。
安全なアクションの選択により、ロボットは衝突が少ない摂動環境でナビゲートできます。

要約(オリジナル)

Autonomous mobile robots are increasingly used in pedestrian-rich environments where safe navigation and appropriate human interaction are crucial. While Deep Reinforcement Learning (DRL) enables socially integrated robot behavior, challenges persist in novel or perturbed scenarios to indicate when and why the policy is uncertain. Unknown uncertainty in decision-making can lead to collisions or human discomfort and is one reason why safe and risk-aware navigation is still an open problem. This work introduces a novel approach that integrates aleatoric, epistemic, and predictive uncertainty estimation into a DRL navigation framework for policy distribution uncertainty estimates. We, therefore, incorporate Observation-Dependent Variance (ODV) and dropout into the Proximal Policy Optimization (PPO) algorithm. For different types of perturbations, we compare the ability of deep ensembles and Monte-Carlo dropout (MC-dropout) to estimate the uncertainties of the policy. In uncertain decision-making situations, we propose to change the robot’s social behavior to conservative collision avoidance. The results show improved training performance with ODV and dropout in PPO and reveal that the training scenario has an impact on the generalization. In addition, MC-dropout is more sensitive to perturbations and correlates the uncertainty type to the perturbation better. With the safe action selection, the robot can navigate in perturbed environments with fewer collisions.

arxiv情報

著者	Daniel Flögel,Marcos Gómez Villafañe,Joshua Ransiek,Sören Hohmann
発行日	2025-02-28 15:38:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.RO, cs.SY, eess.SY | コメントを受け付けていません

From Commands to Prompts: LLM-based Semantic File System for AIOS

投稿日: 2025年3月3日作成者: jarxiv

要約

大規模な言語モデル（LLM）は、LLMベースのエージェントやエージェントオペレーティングシステム（AIO）などのインテリジェントアプリケーションとシステムの開発に大きな可能性を示しています。
ただし、これらのアプリケーションとシステムが基礎となるファイルシステムと相互作用する場合、ファイルシステムは依然として従来のパラダイムであり、正確なコマンドを介した手動ナビゲーションに依存しています。
このパラダイムは、複雑なフォルダー階層をナビゲートし、不可解なファイル名を覚えておく必要があるため、これらのシステムの使いやすさにボトルネックを提起します。
この制限に対処するために、プロンプト駆動型ファイル管理のためのLLMベースのセマンティックファイルシステム（LSFS）を提案します。
従来のアプローチとは異なり、LSFはLLMを組み込んで、ユーザーまたはエージェントが自然言語プロンプトを介してファイルと対話できるようにし、セマンティックファイル管理を促進します。
マクロレベルでは、セマンティックファイルの取得、ファイルの更新監視と要約、セマンティックファイルロールバックなどのセマンティックファイル管理機能を実現するための包括的なAPIセットを開発します。
マイクロレベルでは、それらのセマンティックインデックスを構築することによりファイルを保存し、さまざまなセマンティック操作のsyscalls（例：Crud、Group by、Join、Join）を搭載したSyscallsを設計および実装します。
私たちの実験は、LSFSがユーザーの利便性、サポートされている機能の多様性、ファイル操作の精度と効率性の観点から、従来のファイルシステムよりも大幅な改善を提供することを示しています。
さらに、LLMの統合により、当社のシステムは、コンテンツの要約やバージョンの比較など、よりインテリジェントなファイル管理タスクを可能にし、その機能をさらに強化します。

要約(オリジナル)

Large language models (LLMs) have demonstrated significant potential in the development of intelligent applications and systems such as LLM-based agents and agent operating systems (AIOS). However, when these applications and systems interact with the underlying file system, the file system still remains the traditional paradigm: reliant on manual navigation through precise commands. This paradigm poses a bottleneck to the usability of these systems as users are required to navigate complex folder hierarchies and remember cryptic file names. To address this limitation, we propose an LLM-based semantic file system ( LSFS ) for prompt-driven file management. Unlike conventional approaches, LSFS incorporates LLMs to enable users or agents to interact with files through natural language prompts, facilitating semantic file management. At the macro-level, we develop a comprehensive API set to achieve semantic file management functionalities, such as semantic file retrieval, file update monitoring and summarization, and semantic file rollback). At the micro-level, we store files by constructing semantic indexes for them, design and implement syscalls of different semantic operations (e.g., CRUD, group by, join) powered by vector database. Our experiments show that LSFS offers significant improvements over traditional file systems in terms of user convenience, the diversity of supported functions, and the accuracy and efficiency of file operations. Additionally, with the integration of LLM, our system enables more intelligent file management tasks, such as content summarization and version comparison, further enhancing its capabilities.

arxiv情報

著者	Zeru Shi,Kai Mei,Mingyu Jin,Yongye Su,Chaoji Zuo,Wenyue Hua,Wujiang Xu,Yujie Ren,Zirui Liu,Mengnan Du,Dong Deng,Yongfeng Zhang
発行日	2025-02-28 15:41:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.DB, cs.HC, cs.LG | コメントを受け付けていません

Graph Sampling for Scalable and Expressive Graph Neural Networks on Homophilic Graphs

投稿日: 2025年3月3日作成者: jarxiv

要約

グラフニューラルネットワーク（GNNS）は、多くのグラフ機械学習タスクで優れていますが、大規模なネットワークにスケーリングする際に課題に直面しています。
GNN転送可能性により、小さなグラフでトレーニングを行い、モデルをより大きなグラフに適用できますが、既存の方法はランダムサブサンプリングに依存していることが多く、サブグラフが切断され、モデルの表現率が低下します。
グラフ構造を保存するために同性愛を活用するレバレッジをレバレッジする新しいグラフサンプリングアルゴリズムを提案します。
データ相関マトリックスのトレースを最小化することにより、この方法は、ランダムサンプリングよりもグラフラプラシアントレース（グラフ接続のプロキシ）をよりよく保持し、スペクトルメソッドよりも低い複雑さを実現します。
引用ネットワークの実験では、ランダムサンプリングと比較して、ラプラシアントレースとGNN移転可能性を維持するパフォーマンスが向上しています。

要約(オリジナル)

Graph Neural Networks (GNNs) excel in many graph machine learning tasks but face challenges when scaling to large networks. GNN transferability allows training on smaller graphs and applying the model to larger ones, but existing methods often rely on random subsampling, leading to disconnected subgraphs and reduced model expressivity. We propose a novel graph sampling algorithm that leverages feature homophily to preserve graph structure. By minimizing the trace of the data correlation matrix, our method better preserves the graph Laplacian trace — a proxy for the graph connectivity — than random sampling, while achieving lower complexity than spectral methods. Experiments on citation networks show improved performance in preserving Laplacian trace and GNN transferability compared to random sampling.

arxiv情報

著者	Haolin Li,Haoyu Wang,Luana Ruiz
発行日	2025-02-28 15:50:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, eess.SP | コメントを受け付けていません

A Survey of Link Prediction in Temporal Networks

投稿日: 2025年3月3日作成者: jarxiv

要約

一時的なネットワークは、過去10年間で複雑なシステム内の動的相互作用をモデル化するために著しい目立っています。
このドメインの重要な課題は、ソーシャルネットワーク分析を含むさまざまなアプリケーションにわたって履歴ネットワーク構造を分析することにより、将来の接続を予測することを目的とした時間的リンク予測（TLP）です。
既存の調査はTLPの特定の側面に対処していますが、通常、表現方法と推論方法を区別する包括的なフレームワークがありません。
この調査では、既存の方法からの表現と推論を明示的に検証する新しい分類法を導入し、TLPのアプローチの新しい分類を提供することにより、このギャップを橋渡しします。
異なる表現技術が時間的および構造的ダイナミクスをどのようにキャプチャし、トランスダクティブおよび誘導予測タスクの両方のさまざまな推論方法との互換性を調べる方法を分析します。
私たちの分類法は、方法論的な状況を明確にするだけでなく、既存の技術の有望な未開の組み合わせを明らかにしています。
この分類法は、モデルの説明可能性や複雑な時間的ネットワークのスケーラブルなアーキテクチャなど、TLPの新たな課題の体系的な基盤を提供します。

要約(オリジナル)

Temporal networks have gained significant prominence in the past decade for modelling dynamic interactions within complex systems. A key challenge in this domain is Temporal Link Prediction (TLP), which aims to forecast future connections by analysing historical network structures across various applications including social network analysis. While existing surveys have addressed specific aspects of TLP, they typically lack a comprehensive framework that distinguishes between representation and inference methods. This survey bridges this gap by introducing a novel taxonomy that explicitly examines representation and inference from existing methods, providing a novel classification of approaches for TLP. We analyse how different representation techniques capture temporal and structural dynamics, examining their compatibility with various inference methods for both transductive and inductive prediction tasks. Our taxonomy not only clarifies the methodological landscape but also reveals promising unexplored combinations of existing techniques. This taxonomy provides a systematic foundation for emerging challenges in TLP, including model explainability and scalable architectures for complex temporal networks.

arxiv情報

著者	Jiafeng Xiong,Ahmad Zareie,Rizos Sakellariou
発行日	2025-02-28 16:00:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.SI | コメントを受け付けていません

Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction

投稿日: 2025年3月3日作成者: jarxiv

要約

特に確率的環境での高次元連続アクション空間での順次意思決定は、重要な計算上の課題に直面しています。
この課題は、従来のオフラインRL設定で検討します。エージェントは、確率的行動ポリシーを通じて収集されたデータに基づいて意思決定を行う方法を学ぶ必要があります。
\ textIT {latentマクロアクションプランナー}（l-map）を提示します。これは、状態条件付きベクトル量子化変分自動エンコーダー（VQ-vae）を介して一時的に拡張されたマクロアクションのセットを学習し、作用の次元を効果的に削減することにより、この課題に対処します。
L-Mapは、潜在的な遷移モデルとして機能し、もっともらしいアクションの効率的なサンプリングを可能にする（個別の）学習された以前のモデルを採用しています。
計画中、私たちのアプローチは、モンテカルロツリー検索（MCTS）を使用して、環境と行動ポリシーの両方における確率性を説明しています。
確率的連続制御タスクを含むオフラインRL設定では、L-Mapは離散潜入アクションを効率的に検索して、高い期待リターンを生成します。
経験的結果は、L-Mapがアクションの次元の増加にもかかわらず、意思決定の遅延が低いことを示しています。
特に、本質的に確率的なダイナミクスを備えた連続制御から高次元ロボットハンドマニピュレーションに至るまでのタスク全体で、L-MAPは既存のモデルベースの方法を大幅に上回り、強力なモデルの批判的なベースラインでPARを実行し、高次元のアクションスペースを備えた複雑および確率的環境での計画における提案されたアプローチの有効性を強調します。

要約(オリジナル)

Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present \textit{Latent Macro Action Planner} (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.

arxiv情報

著者	Baiting Luo,Ava Pettet,Aron Laszka,Abhishek Dubey,Ayan Mukhopadhyay
発行日	2025-02-28 16:02:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, cs.RO | コメントを受け付けていません

The BrowserGym Ecosystem for Web Agent Research

投稿日: 2025年3月3日作成者: jarxiv

要約

Browsergym Ecosystemは、Webエージェントの効率的な評価とベンチマーク、特に自動化と大規模な言語モデル（LLMS）のレバレッジングの必要性に拡大する必要性に対処しています。
多くの既存のベンチマークは、断片化と一貫性のない評価方法論に悩まされており、信頼できる比較と再現性のある結果を達成することが困難です。
以前の研究では、Drouin et al。
（2024）は、明確に定義された観測とアクションスペースを備えた統一されたジムのような環境を提供し、多様なベンチマーク全体で標準化された評価を促進することにより、これを解決することを目的としたBrowsergymを導入しました。
Webエージェント研究のための拡張BrowsErgymベースのエコシステムを提案します。これは、文献から既存のベンチマークを統一し、エージェントの作成、テスト、分析を支援する補完的なフレームワークであるAgentLabを含みます。
提案されているエコシステムは、一貫した評価と包括的な実験管理を確保しながら、新しいベンチマークを統合するための柔軟性を提供します。
サポートする証拠として、最初の大規模でマルチベンチマークWebエージェント実験を実施し、Browsergymで利用可能になった6つの人気のあるWebエージェントベンチマークにわたる6つの最先端のLLMのパフォーマンスを比較します。
他の発見の中でも、我々の結果は、GPT-4Oが優れている視覚関連のタスクを除き、Claude-3.5-Sonnetがほぼすべてのベンチマークで先導し、OpenaiとAnthropicの最新モデルの間の大きな矛盾を強調しています。
これらの進歩にもかかわらず、我々の結果は、実際のWeb環境に固有の複雑さと現在のモデルの制限により、堅牢で効率的なWebエージェントの構築が重要な課題のままであることを強調しています。

要約(オリジナル)

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic’s latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

arxiv情報

著者	Thibault Le Sellier De Chezelles,Maxime Gasse,Alexandre Drouin,Massimo Caccia,Léo Boisvert,Megh Thakkar,Tom Marty,Rim Assouel,Sahar Omidi Shayegan,Lawrence Keunho Jang,Xing Han Lù,Ori Yoran,Dehan Kong,Frank F. Xu,Siva Reddy,Quentin Cappart,Graham Neubig,Ruslan Salakhutdinov,Nicolas Chapados,Alexandre Lacoste
発行日	2025-02-28 16:02:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG, cs.SE | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント