jarxiv | Japanese arxiv | ページ 504

Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

投稿日: 2025年5月15日作成者: jarxiv

要約

言語モデル（LM）エージェントは、意思決定を導くために積極的に情報を収集する必要がある自律的な意思決定者としてますます使用されています。
このようなエージェントにとって重要な認知スキルは、世界の因果構造の効率的な調査と理解です。これは、堅牢で科学的に根拠のある推論の鍵です。
しかし、LMSがこの能力を持っているのか、誤った結論につながる体系的なバイアスを示すのかは不明のままです。
この作業では、発達心理学から確立された「ブリケットテスト」パラダイムを使用して、LMSの因果関係を調査および推測する能力を調べます。
LMSは、一般的な、直感的な分離的な因果関係を確実に推測しますが、異常な、しかし等しく（またはさらに多く）示されている結膜と体系的に闘っていることがわかります。
この「分離的なバイアス」は、モデルファミリ、サイズ、促進戦略全体にわたって持続し、タスクの複雑さが増加するにつれてパフォーマンスがさらに低下します。
興味深いことに、人間の成人には類似のバイアスが現れ、LMSがトレーニングデータから深部シートの推論ヒューリスティックを継承した可能性があることを示唆しています。
この目的のために、LMSと人間の類似性を定量化し、LMSが成人のような推論プロファイルを示している（ただし、子供のようなものではない）ことを発見しました。
最後に、LMからの因果関係に関する仮説を明示的にサンプリングおよび排除するテスト時間サンプリング方法を提案します。
このスケーラブルなアプローチは、分離的なバイアスを大幅に削減し、LMSを科学的で因果的に厳密な推論の目標に近づけます。

要約(オリジナル)

Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world — key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs’ ability to explore and infer causal relationships, using the well-established ‘Blicket Test’ paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This ‘disjunctive bias’ persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not children-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

arxiv情報

著者	Anthony GX-Chen,Dongyan Lin,Mandana Samiei,Doina Precup,Blake A. Richards,Rob Fergus,Kenneth Marino
発行日	2025-05-14 17:59:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

BioVFM-21M: Benchmarking and Scaling Self-Supervised Vision Foundation Models for Biomedical Image Analysis

投稿日: 2025年5月15日作成者: jarxiv

要約

モデルとデータサイズのスケーリングにより、幅広いタスクよりも印象的なパフォーマンスの改善が実証されています。
汎用タスクのスケーリング行動に関する広範な研究にもかかわらず、医療画像は自然データと大きな違いを示します。
医療領域でのスケーリング行動の広範な理解がないため、Medical Vision Foundationモデルを大規模に開発する際の重要な要因が不明のままです。
このホワイトペーパーでは、モデルサイズ、トレーニングアルゴリズム、データサイズ、および自己監視学習によるスケーラブルなMedical Vision Foundationモデルの開発におけるイメージングモダリティ全体のスケーリング動作を調査しました。
スケーラブルな事前トレーニングをサポートするために、幅広い生物医学的画像モダリティと解剖学を含む大規模な生物医学画像データセットであるBioVFM-21Mを導入します。
スケールアップは利点を提供するが、タスクによって異なることを観察しました。
追加の分析により、スケーリングの利点と相関するいくつかの要因が明らかになります。
最後に、2100万人の生物医学画像を前提とした大規模なMedical Vision FoundationモデルであるBioVFMを提案します。これは、12の医療ベンチマークで以前の最先端の基礎モデルを上回ります。
私たちの結果は、スケールアップがより良いパフォーマンスを追求するために有益である一方で、タスクの特性、データの多様性、事前供与方法、および計算効率がスケーラブルな医療基盤モデルを開発するための重要な考慮事項のままであることを強調しています。

要約(オリジナル)

Scaling up model and data size have demonstrated impressive performance improvement over a wide range of tasks. Despite extensive studies on scaling behaviors for general-purpose tasks, medical images exhibit substantial differences from natural data. It remains unclear the key factors in developing medical vision foundation models at scale due to the absence of an extensive understanding of scaling behavior in the medical domain. In this paper, we explored the scaling behavior across model sizes, training algorithms, data sizes, and imaging modalities in developing scalable medical vision foundation models by self-supervised learning. To support scalable pretraining, we introduce BioVFM-21M, a large-scale biomedical image dataset encompassing a wide range of biomedical image modalities and anatomies. We observed that scaling up does provide benefits but varies across tasks. Additional analysis reveals several factors correlated with scaling benefits. Finally, we propose BioVFM, a large-scale medical vision foundation model pretrained on 21 million biomedical images, which outperforms the previous state-of-the-art foundation models across 12 medical benchmarks. Our results highlight that while scaling up is beneficial for pursuing better performance, task characteristics, data diversity, pretraining methods, and computational efficiency remain critical considerations for developing scalable medical foundation models.

arxiv情報

著者	Jiarun Liu,Hong-Yu Zhou,Weijian Huang,Hao Yang,Dongning Song,Tao Tan,Yong Liang,Shanshan Wang
発行日	2025-05-14 12:25:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

DCSNet: A Lightweight Knowledge Distillation-Based Model with Explainable AI for Lung Cancer Diagnosis from Histopathological Images

投稿日: 2025年5月15日作成者: jarxiv

要約

肺がんは、生存率を改善するために早期発見と正確な診断が重要である世界的に癌関連の死亡の主な原因です。
深い学習、特に畳み込みニューラルネットワーク（CNNS）は、初期段階の肺癌を示す微妙なパターンを検出することにより、医療画像分析に革命をもたらしましたが、その採用は課題に直面しています。
これらのモデルは多くの場合、計算上高価であり、重要なリソースを必要とするため、リソースが制約された環境には適していません。
さらに、彼らの透明性の欠如は、ヘルスケアのような繊細な分野での信頼とより広範な採用を妨げます。
知識の蒸留は、知識を大きく複雑なモデル（教師）から小型の軽量モデル（学生）に移転することにより、これらの課題に対処します。
モデルの透明性を高めるために、説明可能なAI（XAI）技術を組み込んだ肺がん検出のための知識蒸留ベースのアプローチを提案します。
ResNet50、EfficientNetB0、EfficientNetB3、およびVGG16を含む8つのCNNが教師モデルとして評価されます。
ResNet50を教師として使用して、軽量の学生モデルを開発および訓練しました。
このアプローチは、リソースに制約のある設定での高い診断パフォーマンスを保証するだけでなく、透明性の懸念にも対処し、ヘルスケアにおけるAI主導の診断ツールの採用を促進します。

要約(オリジナル)

Lung cancer is a leading cause of cancer-related deaths globally, where early detection and accurate diagnosis are critical for improving survival rates. While deep learning, particularly convolutional neural networks (CNNs), has revolutionized medical image analysis by detecting subtle patterns indicative of early-stage lung cancer, its adoption faces challenges. These models are often computationally expensive and require significant resources, making them unsuitable for resource constrained environments. Additionally, their lack of transparency hinders trust and broader adoption in sensitive fields like healthcare. Knowledge distillation addresses these challenges by transferring knowledge from large, complex models (teachers) to smaller, lightweight models (students). We propose a knowledge distillation-based approach for lung cancer detection, incorporating explainable AI (XAI) techniques to enhance model transparency. Eight CNNs, including ResNet50, EfficientNetB0, EfficientNetB3, and VGG16, are evaluated as teacher models. We developed and trained a lightweight student model, Distilled Custom Student Network (DCSNet) using ResNet50 as the teacher. This approach not only ensures high diagnostic performance in resource-constrained settings but also addresses transparency concerns, facilitating the adoption of AI-driven diagnostic tools in healthcare.

arxiv情報

著者	Sadman Sakib Alif,Nasim Anzum Promise,Fiaz Al Abid,Aniqua Nusrat Zereen
発行日	2025-05-14 12:28:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition

投稿日: 2025年5月15日作成者: jarxiv

要約

このペーパーでは、3D/4Dデータからの顔の感情の監視されていない対照的なマルチビュー表現学習のために設計されたビジョン言語モデルであるMultiViewVLMを紹介します。
当社のアーキテクチャは、生成されたテキストプロンプトから派生した擬似ラベルを統合して、感情的なセマンティクスの暗黙のアライメントを導きます。
マルチビュー全体で共有情報をキャプチャするために、明示的な監督を必要とせずにマルチビュー表現を揃えるジョイント埋め込みスペースを提案します。
さらに、安定したポジティブネガティブペアサンプリングを活用する新しいマルチビューコントラスト学習戦略を通じて、モデルの識別性を高めます。
勾配に優しい損失関数が導入され、よりスムーズで安定した収束を促進し、モデルはスケーラビリティを確保するために分散トレーニング用に最適化されています。
広範な実験は、MultiViewVLMが既存の最先端の方法よりも優れており、最小限の変更でさまざまな現実世界のアプリケーションに簡単に適応できることを示しています。

要約(オリジナル)

In this paper, we introduce MultiviewVLM, a vision-language model designed for unsupervised contrastive multiview representation learning of facial emotions from 3D/4D data. Our architecture integrates pseudo-labels derived from generated textual prompts to guide implicit alignment of emotional semantics. To capture shared information across multi-views, we propose a joint embedding space that aligns multiview representations without requiring explicit supervision. We further enhance the discriminability of our model through a novel multiview contrastive learning strategy that leverages stable positive-negative pair sampling. A gradient-friendly loss function is introduced to promote smoother and more stable convergence, and the model is optimized for distributed training to ensure scalability. Extensive experiments demonstrate that MultiviewVLM outperforms existing state-of-the-art methods and can be easily adapted to various real-world applications with minimal modifications.

arxiv情報

著者	Muzammil Behzad
発行日	2025-05-14 12:31:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

GreenFactory: Ensembling Zero-Cost Proxies to Estimate Performance of Neural Networks

投稿日: 2025年5月15日作成者: jarxiv

要約

最適なアーキテクチャとハイパーパラメーターを特定するには、神経アーキテクチャの検索プロセス中に深いニューラルネットワークのパフォーマンスを決定することが不可欠です。
従来、このプロセスでは、各ネットワークのトレーニングと評価が必要であり、これは時間がかかり、リソース集約型です。
ゼロコストのプロキシは、トレーニングなしでパフォーマンスを推定し、従来のトレーニングに代わるものとして機能します。
ただし、最近のプロキシは、多様なシナリオ全体で一般化を欠いていることが多く、予測される精度ではなく相対的なランキングのみを提供します。
これらの制限に対処するために、ランダムな森林回帰を活用して複数の予測因子の強度を組み合わせ、モデルテストの精度を直接予測するゼロコストプロキシのアンサンブルであるGreenFactoryを提案します。
NATSベンチでGreenFactoryを評価し、複数のデータセットで堅牢な結果を達成します。
具体的には、GreenFactoryはNATS-Bench-SSSで高いケンダル相関を達成し、予測されたスコアと実際のパフォーマンスとの実質的な一致を示しています。
同様に、NATS-Bench-TSSでは、CIFAR-10で0.921、CIFAR-100で0.929、Imagenet-16-120で0.908の相関を達成し、両方の検索スペースで信頼性を示します。

要約(オリジナル)

Determining the performance of a Deep Neural Network during Neural Architecture Search processes is essential for identifying optimal architectures and hyperparameters. Traditionally, this process requires training and evaluation of each network, which is time-consuming and resource-intensive. Zero-cost proxies estimate performance without training, serving as an alternative to traditional training. However, recent proxies often lack generalization across diverse scenarios and provide only relative rankings rather than predicted accuracies. To address these limitations, we propose GreenFactory, an ensemble of zero-cost proxies that leverages a random forest regressor to combine multiple predictors’ strengths and directly predict model test accuracy. We evaluate GreenFactory on NATS-Bench, achieving robust results across multiple datasets. Specifically, GreenFactory achieves high Kendall correlations on NATS-Bench-SSS, indicating substantial agreement between its predicted scores and actual performance: 0.907 for CIFAR-10, 0.945 for CIFAR-100, and 0.920 for ImageNet-16-120. Similarly, on NATS-Bench-TSS, we achieve correlations of 0.921 for CIFAR-10, 0.929 for CIFAR-100, and 0.908 for ImageNet-16-120, showcasing its reliability in both search spaces.

arxiv情報

著者	Gabriel Cortês,Nuno Lourenço,Paolo Romano,Penousal Machado
発行日	2025-05-14 12:40:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

3D Cartoon Face Generation with Controllable Expressions from a Single GAN Image

投稿日: 2025年5月15日作成者: jarxiv

要約

この論文では、単一の2D GAN生成された人間の顔から3D監督なしで3D漫画の顔の形を生成するというオープンな研究タスクを調査し、3D形状の表情を操作することもできます。
この目的のために、スタイルガンの潜在スペースの意味的な意味を発見し、潜在コードを制御することにより、さまざまな表現、ポーズ、照明条件の顔の画像を作成できるようにします。
具体的には、最初に、漫画データセットの前提条件のスタイルガンフェイスモデルを微調整します。
同じ潜在的なコードを顔を合わせて漫画生成モデルに供給することにより、2Dの人間の顔の画像から漫画スタイルのアバターへの翻訳を実現することを目指しています。
次に、元のアイデンティティを維持しながら表情を変えようとするために、Gan潜在空間の意味的な方向を発見します。
漫画の顔に3D注釈がないため、潜在コードを操作して、さまざまなポーズと照明条件を持つ画像を生成し、3D漫画の顔の形を再構築できるようにします。
3つの漫画データセットでのメソッドの有効性を定性的かつ定量的に検証します。

要約(オリジナル)

In this paper, we investigate an open research task of generating 3D cartoon face shapes from single 2D GAN generated human faces and without 3D supervision, where we can also manipulate the facial expressions of the 3D shapes. To this end, we discover the semantic meanings of StyleGAN latent space, such that we are able to produce face images of various expressions, poses, and lighting conditions by controlling the latent codes. Specifically, we first finetune the pretrained StyleGAN face model on the cartoon datasets. By feeding the same latent codes to face and cartoon generation models, we aim to realize the translation from 2D human face images to cartoon styled avatars. We then discover semantic directions of the GAN latent space, in an attempt to change the facial expressions while preserving the original identity. As we do not have any 3D annotations for cartoon faces, we manipulate the latent codes to generate images with different poses and lighting conditions, such that we can reconstruct the 3D cartoon face shapes. We validate the efficacy of our method on three cartoon datasets qualitatively and quantitatively.

arxiv情報

著者	Hao Wang,Wenhao Shen,Guosheng Lin,Steven C. H. Hoi,Chunyan Miao
発行日	2025-05-14 12:40:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

PRISM: A Unified Framework for Photorealistic Reconstruction and Intrinsic Scene Modeling

投稿日: 2025年5月15日作成者: jarxiv

要約

単一の基礎モデルで複数の画像生成と編集タスクを可能にする統一されたフレームワークであるPrismを提示します。
事前に訓練されたテキストからイメージまでの拡散モデルから、Prismは、同時に本質的なマップ（X層と呼ばれる）とともにRGB画像を生成する効果的な微調整戦略を提案します。
内因性特性を個別に推測するか、分解と条件付き生成のために個別のモデルを必要とする以前のアプローチとは異なり、Prismはすべての固有層を共同で生成することにより、モダリティ全体の一貫性を維持します。
テキストからRGBXの生成、RGBからXの分解、X-to-RGBX条件付きの生成など、多様なタスクをサポートします。
さらに、PRISMは、選択した内因性レイヤーとテキストプロンプトのコンディショニングを通じて、グローバルおよびローカル画像の両方の編集を可能にします。
広範な実験は、基本モデルのテキストからイメージまでの生成機能を維持しながら、本質的な画像分解と条件付き画像生成の両方のプリズムの競合パフォーマンスを示しています。

要約(オリジナル)

We present PRISM, a unified framework that enables multiple image generation and editing tasks in a single foundational model. Starting from a pre-trained text-to-image diffusion model, PRISM proposes an effective fine-tuning strategy to produce RGB images along with intrinsic maps (referred to as X layers) simultaneously. Unlike previous approaches, which infer intrinsic properties individually or require separate models for decomposition and conditional generation, PRISM maintains consistency across modalities by generating all intrinsic layers jointly. It supports diverse tasks, including text-to-RGBX generation, RGB-to-X decomposition, and X-to-RGBX conditional generation. Additionally, PRISM enables both global and local image editing through conditioning on selected intrinsic layers and text prompts. Extensive experiments demonstrate the competitive performance of PRISM both for intrinsic image decomposition and conditional image generation while preserving the base model’s text-to-image generation capability.

arxiv情報

著者	Alara Dirik,Tuanfeng Wang,Duygu Ceylan,Stefanos Zafeiriou,Anna Frühstück
発行日	2025-05-14 12:50:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.GR | コメントを受け付けていません

MCP-MedSAM: A Powerful Lightweight Medical Segment Anything Model Trained with a Single GPU in Just One Day

投稿日: 2025年5月15日作成者: jarxiv

要約

医療画像のセグメンテーションには、解剖学的構造と病変の特定に焦点を当てた医療画像を意味のある地域に分割することが含まれます。
ヘルスケアには幅広いアプリケーションがあり、深い学習方法により、このプロセスの自動化において大きな進歩が可能になりました。
最近、セグメンテーションタスクの最初の基礎モデルであるセグメンテーションAnything Model（SAM）の導入により、研究者はさまざまなタスクのパフォーマンスを改善するために医療ドメインに適応するようになりました。
ただし、SAMの大きなモデルサイズと高いGPU要件は、医療領域でのスケーラビリティと開発を妨げています。
この作業では、優れたセグメンテーションパフォーマンスを提供しながら、1日以内に40GBのメモリを備えた単一のA100 GPUでトレーニング可能になるように設計された強力で軽量の医療SAMモデルであるMCP-Medsamを提案します。
モダリティと境界ボックス内の直接セグメンテーションターゲット情報の必要性との重要な内部違いを認識すると、2種類のプロンプト、モダリティプロンプトとコンテンツプロンプトを紹介します。
プロンプトエンコーダーを通過した後、埋め込み表現は、重要なトレーニングオーバーヘッドを追加せずに、より関連性のある情報を組み込むことにより、セグメンテーションパフォーマンスをさらに改善できます。
さらに、効果的なモダリティベースのデータサンプリング戦略を採用して、モダリティ間のデータの不均衡に対処し、すべてのモダリティでよりバランスの取れたパフォーマンスを確保します。
私たちの方法は、チャレンジリーダーボードのトップランクの方法と比較して、大規模なチャレンジデータセットを使用してトレーニングおよび評価されました。MCP-Medsamは優れたパフォーマンスを達成し、単一のGPUで1日のトレーニングを必要としました。
このコードは、\ textcolor {blue} {https://github.com/dong845/mcp-medsam}。}で公開されています。}

要約(オリジナル)

Medical image segmentation involves partitioning medical images into meaningful regions, with a focus on identifying anatomical structures and lesions. It has broad applications in healthcare, and deep learning methods have enabled significant advancements in automating this process. Recently, the introduction of the Segmentation Anything Model (SAM), the first foundation model for segmentation task, has prompted researchers to adapt it for the medical domain to improve performance across various tasks. However, SAM’s large model size and high GPU requirements hinder its scalability and development in the medical domain. In this work, we propose MCP-MedSAM, a powerful and lightweight medical SAM model designed to be trainable on a single A100 GPU with 40GB of memory within one day while delivering superior segmentation performance. Recognizing the significant internal differences between modalities and the need for direct segmentation target information within bounding boxes, we introduce two kinds of prompts: the modality prompt and the content prompt. After passing through the prompt encoder, their embedding representations can further improve the segmentation performance by incorporating more relevant information without adding significant training overhead. Additionally, we adopt an effective modality-based data sampling strategy to address data imbalance between modalities, ensuring more balanced performance across all modalities. Our method was trained and evaluated using a large-scale challenge dataset, compared to top-ranking methods on the challenge leaderboard, MCP-MedSAM achieved superior performance while requiring only one day of training on a single GPU. The code is publicly available at \textcolor{blue}{https://github.com/dong845/MCP-MedSAM}.}

arxiv情報

著者	Donghang Lyu,Ruochen Gao,Marius Staring
発行日	2025-05-14 12:51:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Neural Brain: A Neuroscience-inspired Framework for Embodied Agents

投稿日: 2025年5月15日作成者: jarxiv

要約

人工知能（AI）の急速な進化は、静的なデータ駆動型モデルから、実際の環境を知覚し、相互作用できる動的システムに移行しました。
パターン認識と象徴的な推論の進歩にもかかわらず、大規模な言語モデルなどの現在のAIシステムは、身体化されておらず、世界と物理的に関与することができません。
この制限により、ヒューマノイドロボットなどの自律剤が非構造化された環境をナビゲートして操作して人間のような適応性を操作する必要がある具体化されたAIの台頭が促進されました。
この課題の核心には、人間のような適応性を備えた具体化された薬剤を駆動するように設計された中央インテリジェンスシステムである神経脳の概念があります。
神経脳は、マルチモーダルセンシングと知覚を認知能力とシームレスに統合する必要があります。
これを達成するには、適応型メモリシステムとエネルギー効率の高いハードウェアソフトウェアの共同設計も必要であり、動的環境でのリアルタイムアクションを可能にします。
このホワイトペーパーでは、具体化された薬剤の神経脳の統一されたフレームワークを紹介し、2つの基本的な課題に対処します。（1）神経脳のコア成分を定義し、（2）静的AIモデルと実際の展開に必要な動的適応性との間のギャップを埋める。
この目的のために、マルチモーダルアクティブセンシング、知覚認知アクション機能、神経可塑性ベースのメモリストレージと更新、および神経型ハードウェア/ソフトウェアの最適化を統合する生物学的にインスパイアされたアーキテクチャを提案します。
さらに、これらの4つの側面にわたる具体化されたエージェントに関する最新の研究をレビューし、現在のAIシステムと人間の知能のギャップを分析します。
神経科学からの洞察を統合することにより、現実世界のシナリオで人間レベルの知性が可能な一般化可能な自律剤の開発に向けたロードマップの概要を説明します。

要約(オリジナル)

The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the rise of embodied AI, where autonomous agents, such as humanoid robots, must navigate and manipulate unstructured environments with human-like adaptability. At the core of this challenge lies the concept of Neural Brain, a central intelligence system designed to drive embodied agents with human-like adaptability. A Neural Brain must seamlessly integrate multimodal sensing and perception with cognitive capabilities. Achieving this also requires an adaptive memory system and energy-efficient hardware-software co-design, enabling real-time action in dynamic environments. This paper introduces a unified framework for the Neural Brain of embodied agents, addressing two fundamental challenges: (1) defining the core components of Neural Brain and (2) bridging the gap between static AI models and the dynamic adaptability required for real-world deployment. To this end, we propose a biologically inspired architecture that integrates multimodal active sensing, perception-cognition-action function, neuroplasticity-based memory storage and updating, and neuromorphic hardware/software optimization. Furthermore, we also review the latest research on embodied agents across these four aspects and analyze the gap between current AI systems and human intelligence. By synthesizing insights from neuroscience, we outline a roadmap towards the development of generalizable, autonomous agents capable of human-level intelligence in real-world scenarios.

arxiv情報

著者	Jian Liu,Xiongtao Shi,Thai Duy Nguyen,Haitian Zhang,Tianxiang Zhang,Wei Sun,Yanjie Li,Athanasios V. Vasilakos,Giovanni Iacca,Arshad Ali Khan,Arvind Kumar,Jae Won Cho,Ajmal Mian,Lihua Xie,Erik Cambria,Lin Wang
発行日	2025-05-14 12:56:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.RO | コメントを受け付けていません

A Deep Learning Approach for Pixel-level Material Classification via Hyperspectral Imaging

投稿日: 2025年5月15日作成者: jarxiv

要約

コンピュータービジョン、特に検出、セグメンテーション、分類における最近の進歩は、さまざまなドメインに大きな影響を与えています。
ただし、これらの進歩はRGBベースのシステムに関連付けられています。RGBベースのシステムは、廃棄物の並べ替え、医薬品、防御などの産業では、形状や色を超えた高度なオブジェクトの特性評価が必要な産業では不十分です。
スペクトル情報と空間情報の両方をキャプチャするハイパースペクトル（HS）イメージングは、特に速度、コスト、安全性の観点から、X線蛍光やラマン分光法などの従来の技術よりもこれらの制限と利点を提供します。
この研究では、HSイメージングを材料の特性評価のために深い学習と組み合わせる可能性を評価します。
研究には以下が含まれます。i）HSカメラ、コンベア、および制御された照明で実験的なセットアップを設計する。
ii）半自動マスク生成とラマン分光法ベースの標識を使用して、さまざまなプラスチック（HDPE、PET、PP、PS）のマルチオブジェクトデータセットを生成します。
およびiii）ピクセルレベルの材料分類のためのHS画像で訓練された深い学習モデルの開発。
このモデルは99.94 \％分類精度を達成し、色、サイズ、形状の不変性の堅牢性を示し、材料の重複を効果的に処理しました。
黒いオブジェクトの課題などの制限についても説明します。
RGBを超えてコンピュータービジョンをHSイメージングに拡張すると、伝統的な方法の主要な制限を克服し、将来のアプリケーションの強力な可能性を示しています。

要約(オリジナル)

Recent advancements in computer vision, particularly in detection, segmentation, and classification, have significantly impacted various domains. However, these advancements are tied to RGB-based systems, which are insufficient for applications in industries like waste sorting, pharmaceuticals, and defense, where advanced object characterization beyond shape or color is necessary. Hyperspectral (HS) imaging, capturing both spectral and spatial information, addresses these limitations and offers advantages over conventional technologies such as X-ray fluorescence and Raman spectroscopy, particularly in terms of speed, cost, and safety. This study evaluates the potential of combining HS imaging with deep learning for material characterization. The research involves: i) designing an experimental setup with HS camera, conveyor, and controlled lighting; ii) generating a multi-object dataset of various plastics (HDPE, PET, PP, PS) with semi-automated mask generation and Raman spectroscopy-based labeling; and iii) developing a deep learning model trained on HS images for pixel-level material classification. The model achieved 99.94\% classification accuracy, demonstrating robustness in color, size, and shape invariance, and effectively handling material overlap. Limitations, such as challenges with black objects, are also discussed. Extending computer vision beyond RGB to HS imaging proves feasible, overcoming major limitations of traditional methods and showing strong potential for future applications.

arxiv情報

著者	Savvas Sifnaios,George Arvanitakis,Fotios K. Konstantinidis,Georgios Tsimiklis,Angelos Amditis,Panayiotis Frangos
発行日	2025-05-14 13:01:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, eess.IV, I.2.10 | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント