jarxiv | Japanese arxiv | ページ 1218

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

投稿日: 2025年3月25日作成者: jarxiv

要約

このホワイトペーパーでは、ビデオモデリングを通じて変化の検出とキャプションタスクを再概念化するフレームワークであるChange3Dを紹介します。
最近の方法では、各ペアの双方対の画像を個別のフレームと見なすことにより、顕著な成功を収めています。
共有重量の画像エンコーダーを使用して空間機能を抽出し、変更抽出器を使用して2つの画像間の違いをキャプチャします。
ただし、タスクに依存しないプロセスである画像機能エンコードは、変化した領域に効果的に出席することはできません。
さらに、さまざまな変更検出およびキャプションタスク用に設計されたさまざまな変更抽出器により、統一されたフレームワークを持つことが困難になります。
これらの課題に取り組むために、Change3dは、小さなビデオに似た2つのフレームで構成されるバイテンポラル画像を小さなビデオと見なしています。
学習可能な知覚フレームを二時型画像間で統合することにより、ビデオエンコーダーを使用すると、知覚フレームが画像と直接対話し、違いを認識できます。
したがって、複雑な変化抽出器を取り除くことができ、さまざまな変化検出およびキャプションタスクの統一されたフレームワークを提供します。
複数のタスクでChange3Dを検証し、8つの標準ベンチマークにわたって、変更検出（バイナリ変更検出、セマンティック変更検出、および構築ダメージ評価を含む）を含み、キャプションを変更します。
ベルとホイッスルがなければ、このシンプルでありながら効果的なフレームワークは、最先端の方法と比較して、パラメーターの〜6％〜13％とフロップの〜8％〜34％で構成される超軽量ビデオモデルで優れたパフォーマンスを実現できます。
Change3Dが2Dベースのモデルに代わるものであり、将来の研究を促進できることを願っています。

要約(オリジナル)

In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.

arxiv情報

著者	Duowang Zhu,Xiaohu Huang,Haiyan Huang,Hao Zhou,Zhenfeng Shao
発行日	2025-03-24 15:48:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Learned, uncertainty-driven adaptive acquisition for photon-efficient scanning microscopy

投稿日: 2025年3月25日作成者: jarxiv

要約

共焦点や多光子顕微鏡などのスキャン顕微鏡システムは、生物学的組織に深く調査するための強力なイメージングツールです。
ただし、スキャンシステムには、獲得時間、視野、光毒性、画質の間に固有のトレードオフがあり、多くの場合、高速、大きな視野、および/または穏やかなイメージングが必要なときに騒々しい測定をもたらします。
ディープラーニングは、ノイズの多い顕微鏡測定を除去するために使用できますが、これらのアルゴリズムは幻覚を起こしやすく、医学的および科学的アプリケーションにとって悲惨なものになる可能性があります。
スキャン顕微鏡システムのピクセルごとの不確実性を同時に除去し、予測する方法を提案し、アルゴリズムの信頼性を改善し、深い学習予測の統計的保証を提供します。
さらに、この学習したピクセルごとの不確実性を活用して、サンプルの最も不確実な領域のみを実行し、時間を節約し、総光量をサンプルに減らす適応獲得技術を促進することを提案します。
実験的な共焦点および多光子顕微鏡システムに関する方法を実証し、不確実性マップが深い学習予測で幻覚を特定できることを示しています。
最後に、適応型取得手法により、サンプルの微細な機能を正常に回復し、幻覚を減らしながら、獲得時間と総光量の最大16倍の短縮を示します。
私たちは、実際の実験データを使用した除去タスクの分布のない不確実性の定量化を実証し、再建の不確実性に基づいて適応獲得を提案した最初のものです。

要約(オリジナル)

Scanning microscopy systems, such as confocal and multiphoton microscopy, are powerful imaging tools for probing deep into biological tissue. However, scanning systems have an inherent trade-off between acquisition time, field of view, phototoxicity, and image quality, often resulting in noisy measurements when fast, large field of view, and/or gentle imaging is needed. Deep learning could be used to denoise noisy microscopy measurements, but these algorithms can be prone to hallucination, which can be disastrous for medical and scientific applications. We propose a method to simultaneously denoise and predict pixel-wise uncertainty for scanning microscopy systems, improving algorithm trustworthiness and providing statistical guarantees for deep learning predictions. Furthermore, we propose to leverage this learned, pixel-wise uncertainty to drive an adaptive acquisition technique that rescans only the most uncertain regions of a sample, saving time and reducing the total light dose to the sample. We demonstrate our method on experimental confocal and multiphoton microscopy systems, showing that our uncertainty maps can pinpoint hallucinations in the deep learned predictions. Finally, with our adaptive acquisition technique, we demonstrate up to 16X reduction in acquisition time and total light dose while successfully recovering fine features in the sample and reducing hallucinations. We are the first to demonstrate distribution-free uncertainty quantification for a denoising task with real experimental data and the first to propose adaptive acquisition based on reconstruction uncertainty.

arxiv情報

著者	Cassandra Tong Ye,Jiashu Han,Kunzan Liu,Anastasios Angelopoulos,Linda Griffith,Kristina Monakhova,Sixian You
発行日	2025-03-24 15:48:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV, physics.optics | コメントを受け付けていません

CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

投稿日: 2025年3月25日作成者: jarxiv

要約

ビデオアノマリー検出（VAD）は、情報の法医学や公共安全保護などの分野で有望なアプリケーションを備えた、ビデオ理解コミュニティにおける根本的でありながら手ごわいタスクのままです。
異常の希少性と多様性のため、既存の方法は、簡単に収集された通常のイベントのみを使用して、監視されていない方法で通常の空間的パターンの固有の正常性をモデル化します。
以前の研究では、既存の監視されていないVADモデルは、現実世界のシナリオでラベルに依存しないデータオフセット（シーンの変化など）が不可能であり、深いニューラルネットワークの過剰な一般化により光の異常に応答できない可能性があることが示されています。
因果関係の学習に触発されて、私たちは、定期的なイベントのプロトタイプパターンを適切に一般化し、異常なインスタンスが発生したときに重要な逸脱を提示できる因果要因が存在すると主張します。
この点で、因果表現の一貫性学習（CRCL）を提案して、監視されていないビデオ正規性学習において、潜在的なシーンに強い因果関係変数を暗黙的に採掘します。
具体的には、構造的因果モデルに基づいて、深い表現で絡み合ったシーンバイアスをそれぞれ取り除き、因果ビデオ正規性を学習するために、シーンデバイアス学習と因果関係に触発された正常性学習を提案します。
ベンチマークでの広範な実験は、従来の深い表現学習よりも方法の優位性を検証します。
さらに、アブレーション研究と拡張検証は、CRCLがマルチシーン設定のラベルに依存しないバイアスに対処し、利用可能なトレーニングデータのみで安定したパフォーマンスを維持できることを示しています。

要約(オリジナル)

Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

arxiv情報

著者	Yang Liu,Hongjin Wang,Zepu Wang,Xiaoguang Zhu,Jing Liu,Peng Sun,Rui Tang,Jianwei Du,Victor C. M. Leung,Liang Song
発行日	2025-03-24 15:50:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection

投稿日: 2025年3月25日作成者: jarxiv

要約

この作業の目的は、事前に訓練されたビジョン言語モデルの可能性を調査することです。
Vision Transformers（VIT）は、AIに生成された画像を検出するための高度なデータ増強戦略で強化されました。
私たちのアプローチは、factify-4.0データセットで訓練された微調整されたVITモデルを活用します。これには、安定した拡散2.1、安定した拡散XL、安定した拡散3、Dall-E 3、Midjourneyなどの最先端モデルによって生成された画像が含まれます。
モデルの堅牢性と一般化を改善するために、トレーニング中にフリッピング、回転、ガウスノイズインジェクション、JPEG圧縮などの摂動技術を採用しています。
実験結果は、VITベースのパイプラインが最先端のパフォーマンスを達成し、検証データセットとテストデータセットの両方で競合する方法を大幅に上回ることを示しています。

要約(オリジナル)

The aim of this work is to explore the potential of pre-trained vision-language models, e.g. Vision Transformers (ViT), enhanced with advanced data augmentation strategies for the detection of AI-generated images. Our approach leverages a fine-tuned ViT model trained on the Defactify-4.0 dataset, which includes images generated by state-of-the-art models such as Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and MidJourney. We employ perturbation techniques like flipping, rotation, Gaussian noise injection, and JPEG compression during training to improve model robustness and generalisation. The experimental results demonstrate that our ViT-based pipeline achieves state-of-the-art performance, significantly outperforming competing methods on both validation and test datasets.

arxiv情報

著者	Shrikant Malviya,Neelanjan Bhowmik,Stamos Katsigiannis
発行日	2025-03-24 15:53:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

投稿日: 2025年3月25日作成者: jarxiv

要約

分散除外検出に関する以前の研究（OODD）は、主に単一モダリティモデルに焦点を当てています。
最近、Clip、OODDメソッドなどの大規模な事前に守られたビジョン言語モデルの出現により、ゼロショットと迅速な学習戦略を通じてこのようなマルチモーダル表現を利用しています。
ただし、これらの方法には通常、前提条件の重みを凍結するか、部分的にのみ調整します。これは、下流のデータセットの最適ではありません。
この論文では、マルチモーダル微調整（MMFT）が注目すべきOODDパフォーマンスを達成できることを強調しています。
OODDの微調整方法の影響を実証しているいくつかの最近の作品にもかかわらず、パフォーマンスの改善には大きな可能性が残っています。
na \ ‘ive微調整方法の制限を調査し、なぜ彼らが前提条件の知識を完全に活用しなかったのかを調べます。
私たちの経験的分析は、この問題が分配内の（ID）埋め込み内のモダリティギャップに起因する可能性があることを示唆しています。
これに対処するために、IDデータの画像とテキストの埋め込みの距離を正規化することにより、クロスモーダルアライメントを強化するトレーニング目標を提案します。
この調整は、異なるモダリティ（つまり、テキストと画像）からの同様のセマンティクスを、拡散表現空間でより密接に整列させることにより、事前に守られたテキスト情報をよりよく利用するのに役立ちます。
提案された正則化は、極球上のエネルギーベースのモデルの最尤推定に対応することを理論的に実証します。
ImagENET-1K OODベンチマークデータセットを利用して、私たちの方法は、事前に抑制された知識を活用する事後のOODDアプローチと組み合わされており、既存の方法を大幅に上回り、最先端のOODDパフォーマンスと主要なID精度を達成します。

要約(オリジナル)

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of na\’ive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

arxiv情報

著者	Jeonghyeon Kim,Sangheum Hwang
発行日	2025-03-24 16:00:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Generative Omnimatte: Learning to Decompose Video into Layers

投稿日: 2025年3月25日作成者: jarxiv

要約

ビデオと入力オブジェクトマスクのセットを考えると、Omnimatteメソッドは、ビデオを、影や反射などの関連する効果とともに、個々のオブジェクトを含む意味的に意味のあるレイヤーに分解することを目的としています。
既存のオムニマッテ法は、静的な背景または正確なポーズと深さの推定を想定し、これらの仮定が違反した場合に不十分な分解を引き起こします。
さらに、自然な動画には生成的な事前の生成が不足しているため、既存の方法は動的に閉塞された領域を完成させることはできません。
オムニマッテの問題に対処するために、新しい生成層状ビデオ分解フレームワークを提示します。
私たちの方法は、固定シーンを想定していないか、カメラのポーズや深さ情報を必要とし、閉塞された動的領域の説得力のある完成を含む、きれいで完全な層を生成します。
私たちの核となるアイデアは、ビデオ拡散モデルをトレーニングして、特定のオブジェクトによって引き起こされるシーン効果を識別および削除することです。
このモデルは、小さく慎重にキュレーションされたデータセットを使用して、既存のビデオの入力モデルから微調整できることを示し、柔らかい影、光沢のある反射、スプラッシュ水などを含む幅広いカジュアルにキャプチャされたビデオの高品質の分解と編集結果を示します。

要約(オリジナル)

Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.

arxiv情報

著者	Yao-Chih Lee,Erika Lu,Sarah Rumbley,Michal Geyer,Jia-Bin Huang,Tali Dekel,Forrester Cole
発行日	2025-03-24 16:08:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

DAGait: Generalized Skeleton-Guided Data Alignment for Gait Recognition

投稿日: 2025年3月25日作成者: jarxiv

要約

歩行認識は、コンピュータービジョンの分野内の有望で革新的な分野として浮上しており、遠隔の人の識別に広く適用されています。
既存の歩行認識方法は、制御された実験室データセットで大きな成功を収めていますが、野生のデータセットに移行するとパフォーマンスが大幅に低下することがよくあります。私たちは、パフォーマンスのギャップは主に野生のデータセットに存在する空間的分布の不一致に起因する可能性があると主張します。
野生で正確な歩行認識を達成するために、スケルトン誘導シルエットアライメント戦略を提案します。これは、スケルトンの事前知識を使用して、対応するシルエットでアフィン変換を実行します。これは、当社の知識を最大限に活用するために、体格認識に対するデータアライメントの影響を調査する最初の研究です。
複数のデータセットとネットワークアーキテクチャにわたって広範な実験を実施しました。結果は、提案されたアライメント戦略の重要な利点を示しています。特に、挑戦的なGait3Dデータセットでは、すべての評価されたネットワークで平均パフォーマンス改善が7.9％を達成しました。
さらに、私たちの方法は、最大24.0％の精度の向上により、クロスドメインデータセットの大幅な改善を達成します。

要約(オリジナル)

Gait recognition is emerging as a promising and innovative area within the field of computer vision, widely applied to remote person identification. Although existing gait recognition methods have achieved substantial success in controlled laboratory datasets, their performance often declines significantly when transitioning to wild datasets.We argue that the performance gap can be primarily attributed to the spatio-temporal distribution inconsistencies present in wild datasets, where subjects appear at varying angles, positions, and distances across the frames. To achieve accurate gait recognition in the wild, we propose a skeleton-guided silhouette alignment strategy, which uses prior knowledge of the skeletons to perform affine transformations on the corresponding silhouettes.To the best of our knowledge, this is the first study to explore the impact of data alignment on gait recognition. We conducted extensive experiments across multiple datasets and network architectures, and the results demonstrate the significant advantages of our proposed alignment strategy.Specifically, on the challenging Gait3D dataset, our method achieved an average performance improvement of 7.9% across all evaluated networks. Furthermore, our method achieves substantial improvements on cross-domain datasets, with accuracy improvements of up to 24.0%.

arxiv情報

著者	Zhengxian Wu,Chuanrui Zhang,Hangrui Xu,Peng Jiao,Haoqian Wang
発行日	2025-03-24 16:08:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Dual-domain Multi-path Self-supervised Diffusion Model for Accelerated MRI Reconstruction

投稿日: 2025年3月25日作成者: jarxiv

要約

磁気共鳴イメージング（MRI）は重要な診断ツールですが、本質的に長い獲得時間は臨床効率と患者の快適性を低下させます。
深い学習、特に拡散モデルの最近の進歩により、MRIの再構築が加速されました。
ただし、既存の拡散モデルのトレーニングはしばしば完全にサンプリングされたデータに依存しており、モデルは高い計算コストが発生し、多くの場合、不確実性の推定が欠けており、臨床的適用性が制限されます。
これらの課題を克服するために、デュアルドメインマルチパス自己監視拡散モデル（DMSM）と呼ばれる新しいフレームワークを提案します。これは、再構築拡散モデルのための軽量ハイブリッド注意ネットワーク、およびマルチパスの推論戦略を強化し、再構成を強化するためのマルチパスの推論戦略を統合します。
従来の拡散ベースのモデルとは異なり、DMSMは完全にサンプリングされたデータからのトレーニングへの依存度を排除し、実際の臨床設定でより実用的になります。
2つのヒトMRIデータセットでDMSMを評価し、特に微細な解剖学的構造を維持し、高加速因子の下でアーティファクトを抑制する際に、いくつかの監視および自己監視のベースラインにわたって好ましいパフォーマンスを達成することを実証しました。
さらに、私たちのモデルは、再構築エラーと合理的によく相関する不確実性マップを生成し、貴重な臨床的に解釈可能なガイダンスを提供し、診断の自信を高める可能性があります。

要約(オリジナル)

Magnetic resonance imaging (MRI) is a vital diagnostic tool, but its inherently long acquisition times reduce clinical efficiency and patient comfort. Recent advancements in deep learning, particularly diffusion models, have improved accelerated MRI reconstruction. However, existing diffusion models’ training often relies on fully sampled data, models incur high computational costs, and often lack uncertainty estimation, limiting their clinical applicability. To overcome these challenges, we propose a novel framework, called Dual-domain Multi-path Self-supervised Diffusion Model (DMSM), that integrates a self-supervised dual-domain diffusion model training scheme, a lightweight hybrid attention network for the reconstruction diffusion model, and a multi-path inference strategy, to enhance reconstruction accuracy, efficiency, and explainability. Unlike traditional diffusion-based models, DMSM eliminates the dependency on training from fully sampled data, making it more practical for real-world clinical settings. We evaluated DMSM on two human MRI datasets, demonstrating that it achieves favorable performance over several supervised and self-supervised baselines, particularly in preserving fine anatomical structures and suppressing artifacts under high acceleration factors. Additionally, our model generates uncertainty maps that correlate reasonably well with reconstruction errors, offering valuable clinically interpretable guidance and potentially enhancing diagnostic confidence.

arxiv情報

著者	Yuxuan Zhang,Jinkui Hao,Bo Zhou
発行日	2025-03-24 16:10:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, eess.IV | コメントを受け付けていません

Learning to segment anatomy and lesions from disparately labeled sources in brain MRI

投稿日: 2025年3月25日作成者: jarxiv

要約

脳磁気共鳴画像（MRI）の病変とともに健康な組織構造のセグメント化は、病変が解剖学の破壊と共同ラベル付けされたトレーニングデータセットの欠如により、今日のアルゴリズムの課題のままです。
このホワイトペーパーでは、病変に起因する混乱に対して堅牢になり、誤ってラベル付けされたトレーニングセット、つまり共同ラベル付きサンプルを必要とせずに、両方を自動的にセグメント化する方法からトレーニングできる方法を提案します。
以前の研究とは対照的に、マルチシーケンスの獲得を活用し、情報を注意メカニズムと統合するために、2つのパスで健康な組織と病変のセグメンテーションを分離します。
推論中、画像固有の適応は、健康な組織予測に対する病変領域の悪影響を減らします。
トレーニング中、メタ学習を通じて適応を考慮し、共同トレーニングを使用して、散らばったトレーニング画像から学習します。
私たちのモデルは、最先端のセグメンテーション方法と比較して、公開されている脳膠芽腫データセットのいくつかの解剖学的構造と病変のパフォーマンスの向上を示しています。

要約(オリジナル)

Segmenting healthy tissue structures alongside lesions in brain Magnetic Resonance Images (MRI) remains a challenge for today’s algorithms due to lesion-caused disruption of the anatomy and lack of jointly labeled training datasets, where both healthy tissues and lesions are labeled on the same images. In this paper, we propose a method that is robust to lesion-caused disruptions and can be trained from disparately labeled training sets, i.e., without requiring jointly labeled samples, to automatically segment both. In contrast to prior work, we decouple healthy tissue and lesion segmentation in two paths to leverage multi-sequence acquisitions and merge information with an attention mechanism. During inference, an image-specific adaptation reduces adverse influences of lesion regions on healthy tissue predictions. During training, the adaptation is taken into account through meta-learning and co-training is used to learn from disparately labeled training images. Our model shows an improved performance on several anatomical structures and lesions on a publicly available brain glioblastoma dataset compared to the state-of-the-art segmentation methods.

arxiv情報

著者	Meva Himmetoglu,Ilja Ciernik,Ender Konukoglu
発行日	2025-03-24 16:13:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment

投稿日: 2025年3月25日作成者: jarxiv

要約

多くの現実世界のユーザークエリ（たとえば、「卵のフライドライスを作るのはどうですか？」）は、料理本と同様に、添付の画像を使用して両方のテキストステップを使用して応答を生成できるシステムから利益を得ることができます。
インターリーブしたテキストと画像を生成するように設計されたモデルは、これらのモダリティ内およびおよび全体で一貫性を確保する上で課題に直面しています。
これらの課題に対処するために、インターリーブされたテキストとイメージの生成のための包括的な評価フレームワークであるISGを提示します。
ISGは、シーングラフ構造を活用してテキストブロックと画像ブロック間の関係をキャプチャし、4つのレベルの粒度の応答を評価します：ホリスティック、構造、ブロックレベル、画像固有。
この多層評価により、一貫性、一貫性、精度の微妙な評価が可能になり、解釈可能な質問回答フィードバックが提供されます。
ISGと併せて、8つのカテゴリと21のサブカテゴリに1,150のサンプルを含むベンチマークであるISGベンチを導入します。
このベンチマークデータセットには、現在のモデルの挑戦的な領域であるスタイル転送などの視覚中心のタスクでモデルを効果的に評価するための複雑な言語視力依存関係と黄金の回答が含まれています。
ISGベンチを使用して、最近の統一されたビジョン言語モデルがインターリーブコンテンツの生成に不十分に機能することを実証します。
個別の言語モデルと画像モデルを組み合わせた構成アプローチは、全体的なレベルで統一されたモデルよりも111％の改善を示していますが、そのパフォーマンスはブロックレベルと画像レベルの両方で最適です。
将来の作業を促進するために、「プランエキスケートレフィン」パイプラインを使用してツールを呼び出し、122％のパフォーマンス改善を達成するベースラインエージェントであるISG-Agentを開発します。

要約(オリジナル)

Many real-world user queries (e.g. ‘How do to make egg fried rice?’) could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a ‘plan-execute-refine’ pipeline to invoke tools, achieving a 122% performance improvement.

arxiv情報

著者	Dongping Chen,Ruoxi Chen,Shu Pu,Zhaoyi Liu,Yanru Wu,Caixi Chen,Benlin Liu,Yue Huang,Yao Wan,Pan Zhou,Ranjay Krishna
発行日	2025-03-24 16:16:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント