jarxiv | Japanese arxiv | ページ 1766

Foundation Models in Computational Pathology: A Review of Challenges, Opportunities, and Impact

投稿日: 2025年2月13日作成者: jarxiv

要約

自己監視のビジョン専用モデルから対照的な視覚言語フレームワークまで、計算病理学は近年急速に進化しています。
生成AI ‘Co-Pilots’は、細胞から病理のスペクトル全体に微妙な視聴覚組織の手がかりを採掘し、包括的なレポートを生成し、複雑なユーザークエリに応答する能力を示しています。
データのスケールは劇的に急増し、数十から数百万のマルチギガピクセルの組織画像に成長していますが、これらのモデルの訓練可能なパラメーターの数は数十億に増加しています。
重要な問題は残っています。生成的で多目的AIのこの新しい波は、臨床診断をどのように変換しますか？
この記事では、これらの革新の真の可能性と臨床診療への統合を探ります。
病理学における基礎モデルの急速な進歩をレビューし、それらのアプリケーションと重要性を明確にします。
より正確には、基礎モデルの定義そのものを調べ、基礎、一般、または多目的にするものを特定し、計算病理への影響を評価します。
さらに、開発と評価に関連する独自の課題に対処します。
これらのモデルは、例外的な予測機能と生成機能を実証していますが、評価基準を強化し、広範囲にわたる臨床採用を促進するためには、グローバルなベンチマークを確立することが重要です。
計算病理学では、フロンティアAIのより広い影響は、最終的に広範な採用と社会的受け入れに依存しています。
直接的な公共の暴露は厳密に必要ではありませんが、誤解を払拭し、信頼を築き、規制支援を確保するための強力なツールのままです。

要約(オリジナル)

From self-supervised, vision-only models to contrastive visual-language frameworks, computational pathology has rapidly evolved in recent years. Generative AI ‘co-pilots’ now demonstrate the ability to mine subtle, sub-visual tissue cues across the cellular-to-pathology spectrum, generate comprehensive reports, and respond to complex user queries. The scale of data has surged dramatically, growing from tens to millions of multi-gigapixel tissue images, while the number of trainable parameters in these models has risen to several billion. The critical question remains: how will this new wave of generative and multi-purpose AI transform clinical diagnostics? In this article, we explore the true potential of these innovations and their integration into clinical practice. We review the rapid progress of foundation models in pathology, clarify their applications and significance. More precisely, we examine the very definition of foundational models, identifying what makes them foundational, general, or multipurpose, and assess their impact on computational pathology. Additionally, we address the unique challenges associated with their development and evaluation. These models have demonstrated exceptional predictive and generative capabilities, but establishing global benchmarks is crucial to enhancing evaluation standards and fostering their widespread clinical adoption. In computational pathology, the broader impact of frontier AI ultimately depends on widespread adoption and societal acceptance. While direct public exposure is not strictly necessary, it remains a powerful tool for dispelling misconceptions, building trust, and securing regulatory support.

arxiv情報

著者	Mohsin Bilal,Aadam,Manahil Raza,Youssef Altherwy,Anas Alsuhaibani,Abdulrahman Abduljabbar,Fahdah Almarshad,Paul Golding,Nasir Rajpoot
発行日	2025-02-12 11:57:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation

投稿日: 2025年2月13日作成者: jarxiv

要約

医療画像のセグメンテーションは、ラベルの希少性のために恐ろしい課題のままです。
大規模な非標識医療データセットのマスクイメージモデリング（MIM）を介したトレーニング前の視覚変圧器（VIT）は、さまざまなダウンストリームタスクの計算効率とモデル一般化の両方を提供する有望なソリューションを提供します。
ただし、現在のVITベースのMIM前訓練フレームワークは、出力層の局所集約表現を主に強調し、より正確な医療下流タスクに必要な細粒の意味情報をよりよくキャプチャするさまざまなVIT層にわたって豊富な表現を活用できません。
上記のギャップを埋めるために、ここで、2つの主要なイノベーションを中心とするシンプルで効果的なVITベースのプリトレーニングソリューションである階層エンコーダー駆動型のMAE（Hi-end-Mae）を提示します。（1）エンコーダー駆動型再構築、
エンコーダーに、マスクされたパッチの再構築をガイドするためのより有益な機能を学習するよう奨励します。
（2）階層的なデコード。これは、階層的なデコード構造を実装して、異なる層にわたって豊富な表現をキャプチャします。
10K CTスキャンの大規模なデータセットでハイエンドMAEを事前に訓練し、7つの公共の医療画像セグメンテーションベンチマークでそのパフォーマンスを評価しました。
広範な実験は、Hi-End-Maeがさまざまな下流タスクにわたって優れた転送学習機能を達成し、医療イメージングアプリケーションにおけるVITの可能性を明らかにすることを示しています。
このコードは、https：//github.com/fenghetan9/hi-end-maeで入手できます

要約(オリジナル)

Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE

arxiv情報

著者	Fenghe Tang,Qingsong Yao,Wenxin Ma,Chenxu Wu,Zihang Jiang,S. Kevin Zhou
発行日	2025-02-12 12:14:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Sat-DN: Implicit Surface Reconstruction from Multi-View Satellite Images with Depth and Normal Supervision

投稿日: 2025年2月13日作成者: jarxiv

要約

衛星イメージングテクノロジーの進歩により、高解像度のマルチビュー衛星画像を取得することがますますアクセスしやすくなり、迅速かつ場所に依存しない地形モデルの再構築が可能になりました。
ただし、従来のステレオマッチング方法は細かい詳細をキャプチャするのに苦労しており、ニューラル放射輝度フィールド（NERF）が高品質の再構成を達成している間、トレーニング時間は非常に長いです。
さらに、ファサードの建物の視認性の低さ、ピクセル間の照明とスタイルの違い、および衛星画像の弱いテクスチャの領域などの課題により、合理的な地形のジオメトリと詳細な建物ファサードを再構築することがさらに困難になります。
これらの問題に対処するために、徐々に訓練された多解像度ハッシュグリッド再構成アーキテクチャを活用する新しいフレームワークであるSAT-DNを提案します。
多解像度のハッシュグリッドはトレーニングを加速しますが、プログレッシブ戦略は学習周波数を徐々に増加させ、粗い低周波ジオメトリを使用して微細な高周波の詳細の再構築を導きます。
深さと通常の制約により、明確な建物の輪郭が保証され、平面分布が正しくなります。
DFC2019データセットでの広範な実験は、SAT-DNが既存の方法を上回り、定性的評価と定量的評価の両方で最先端の結果を達成することを示しています。
このコードは、https：//github.com/costune/satdnで入手できます。

要約(オリジナル)

With advancements in satellite imaging technology, acquiring high-resolution multi-view satellite imagery has become increasingly accessible, enabling rapid and location-independent ground model reconstruction. However, traditional stereo matching methods struggle to capture fine details, and while neural radiance fields (NeRFs) achieve high-quality reconstructions, their training time is prohibitively long. Moreover, challenges such as low visibility of building facades, illumination and style differences between pixels, and weakly textured regions in satellite imagery further make it hard to reconstruct reasonable terrain geometry and detailed building facades. To address these issues, we propose Sat-DN, a novel framework leveraging a progressively trained multi-resolution hash grid reconstruction architecture with explicit depth guidance and surface normal consistency constraints to enhance reconstruction quality. The multi-resolution hash grid accelerates training, while the progressive strategy incrementally increases the learning frequency, using coarse low-frequency geometry to guide the reconstruction of fine high-frequency details. The depth and normal constraints ensure a clear building outline and correct planar distribution. Extensive experiments on the DFC2019 dataset demonstrate that Sat-DN outperforms existing methods, achieving state-of-the-art results in both qualitative and quantitative evaluations. The code is available at https://github.com/costune/SatDN.

arxiv情報

著者	Tianle Liu,Shuangming Zhao,Wanshou Jiang,Bingxuan Guo
発行日	2025-02-12 12:27:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Uncertainty Aware Human-machine Collaboration in Camouflaged Object Detection

投稿日: 2025年2月13日作成者: jarxiv

要約

環境内で隠されたオブジェクトを識別するタスクであるカモフラージュオブジェクト検出（COD）は、その幅広い実用的なアプリケーションのために急速に成長しています。
信頼できるCODシステムを開発するための重要なステップは、不確実性の推定と効果的な利用です。
この作業では、カモフラージュオブジェクトの存在を分類するためのヒューマンマシンコラボレーションフレームワークを提案し、コンピュータービジョン（CV）モデルと非侵襲的脳コンピューターインターフェイス（BCIS）の相補的な強さを活用します。
私たちのアプローチでは、CVモデルの予測の不確実性を推定するためのマルチビューバックボーンを導入し、トレーニング中にこの不確実性を利用して効率を改善し、より信頼性の高い意思決定のテスト中にRSVPベースのBCISを介して人間の評価の低いケースを排除します。
迷彩データセットのフレームワークを評価し、既存の方法と比較して、平均精度（BA）で平均4.56 \％、F1スコアで3.66 \％の平均改善で最先端の結果を達成しました。
最高のパフォーマンスの参加者の場合、改善はBAで7.6 \％、F1スコアで6.66 \％に達しました。
トレーニングプロセスの分析により、信頼測定と精度の間の強い相関関係が明らかになりましたが、アブレーション研究により、提案されたトレーニングポリシーの有効性とヒューマンマシンコラボレーション戦略が確認されました。
一般に、この作業は人間の認知負荷を削減し、システムの信頼性を向上させ、現実世界のCODアプリケーションとヒューマンコンピューターの相互作用の進歩の強力な基盤を提供します。
私たちのコードとデータは、https：//github.com/ziyuey/unc evanterty-aware-human-machine-collaboration-in-camouflage-object-識別で入手できます。

要約(オリジナル)

Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56\% in balanced accuracy (BA) and 3.66\% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6\% in BA and 6.66\% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: https://github.com/ziyuey/Uncertainty-aware-human-machine-collaboration-in-camouflaged-object-identification.

arxiv情報

著者	Ziyue Yang,Kehan Wang,Yuhang Ming,Yong Peng,Han Yang,Qiong Chen,Wanzeng Kong
発行日	2025-02-12 13:05:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

AdvSwap: Covert Adversarial Perturbation with High Frequency Info-swapping for Autonomous Driving Perception

投稿日: 2025年2月13日作成者: jarxiv

要約

自律車両（AVS）の知覚モジュールは、攻撃の影響を受けやすくなり、敵対的な入力を通じてニューラルネットワークの脆弱性を活用し、それによってAIの安全性が損なわれます。
一部の研究では、秘密の敵対的なサンプルの作成に焦点を当てていますが、既存のグローバルノイズ技術は検出可能であり、人間の視覚システムを欺くのが困難です。
このペーパーでは、新しい敵対的な攻撃方法であるAdvswapを紹介します。Advswapは、ウェーブレットベースの高頻度情報スワッピングを創造的に利用して、秘密の敵対サンプルを生成し、カメラを欺いています。
ADVSWAPは、選択的な高周波情報交換に反転性ニューラルネットワークを採用し、前方の伝播とデータの完全性の両方を維持します。
このスキームは、元のラベルデータを効果的に削除し、ガイダンス画像データを組み込み、隠された堅牢な敵対サンプルを生成します。
GTSRBおよびNuscenesデータセットの実験的評価と比較は、ADVSWAPが一般的なトラフィック目標に隠された攻撃を行うことができることを示しています。
生成された敵対的なサンプルは、人間とアルゴリズムによって知覚することも困難です。
一方、この方法には、強い攻撃の堅牢性と攻撃移動性があります。

要約(オリジナル)

Perception module of Autonomous vehicles (AVs) are increasingly susceptible to be attacked, which exploit vulnerabilities in neural networks through adversarial inputs, thereby compromising the AI safety. Some researches focus on creating covert adversarial samples, but existing global noise techniques are detectable and difficult to deceive the human visual system. This paper introduces a novel adversarial attack method, AdvSwap, which creatively utilizes wavelet-based high-frequency information swapping to generate covert adversarial samples and fool the camera. AdvSwap employs invertible neural network for selective high-frequency information swapping, preserving both forward propagation and data integrity. The scheme effectively removes the original label data and incorporates the guidance image data, producing concealed and robust adversarial samples. Experimental evaluations and comparisons on the GTSRB and nuScenes datasets demonstrate that AdvSwap can make concealed attacks on common traffic targets. The generates adversarial samples are also difficult to perceive by humans and algorithms. Meanwhile, the method has strong attacking robustness and attacking transferability.

arxiv情報

著者	Yuanhao Huang,Qinfan Zhang,Jiandong Xing,Mengyue Cheng,Haiyang Yu,Yilong Ren,Xiao Xiong
発行日	2025-02-12 13:05:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features

投稿日: 2025年2月13日作成者: jarxiv

要約

最近、ビデオからの動的な3Dオブジェクトの生成が印象的な結果を示しています。
既存の方法は、フレーム内の情報全体を使用してガウスを直接最適化します。
ただし、特に静的領域が大きな割合を占める場合、動的領域にフレーム内の静的領域が織り込まれている場合、既存の方法は動的領域の情報を見落とし、静的領域で過剰に適合する傾向があります。
これにより、ぼやけたテクスチャで結果が生成されます。
動的表現を強化するための動的な静的特徴を分離すると、この問題を軽減できると考えています。
したがって、動的な静的機能デカップリングモジュール（DSFD）を提案します。
時間軸に沿って、基準フレームの特徴と比較して大きな違いをダイナミックな特徴として持つ現在のフレーム機能の一部を考慮しています。
逆に、残りの部分は静的な特徴です。
次に、動的な機能と現在のフレーム機能によって駆動される分離された機能を取得します。
さらに、異なる視点から分離された特徴の動的表現をさらに強化し、正確なモーション予測を確保するために、時間空間類似性融合モジュール（TSSF）を設計します。
空間軸に沿って、動的領域の同様の情報を適応的に選択します。
上記のヒンジで、新しいアプローチ、DS4Dを構築します。
実験結果を確認する方法は、ビデオから4Dで最先端の（SOTA）結果を達成します。
さらに、実際のシナリオデータセットでの実験は、4Dシーンでの有効性を示しています。
私たちのコードは公開されます。

要約(オリジナル)

Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the portions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects a similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a real-world scenario dataset demonstrate its effectiveness on the 4D scene. Our code will be publicly available.

arxiv情報

著者	Liying Yang,Chen Liu,Zhenwei Zhu,Ajian Liu,Hui Ma,Jian Nong,Yanyan Liang
発行日	2025-02-12 13:08:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Robust Visual Representation Learning with Multi-modal Prior Knowledge for Image Classification Under Distribution Shift

投稿日: 2025年2月13日作成者: jarxiv

要約

コンピュータービジョンにおける深いニューラルネットワーク（DNNS）の顕著な成功にもかかわらず、トレーニングとテストデータの間の分布シフトに直面する場合、それらは高性能のままではありません。
この論文では、分布シフトの下での一般化を改善するために、マルチモーダルの事前知識を活用する分布ベースの学習アプローチである知識誘導視覚表現学習（KGV）を提案します。
2つの異なるモダリティからの知識を統合します。1）階層的および関連性のある関係を持つ知識グラフ（kg）。
2）KGで意味的に表される視覚要素の合成画像を生成しました。
それぞれの埋め込みは、一般的な潜在空間の与えられたモダリティから生成されます。つまり、元の画像と合成画像からの視覚埋め込み、および知識グラフ埋め込み（KGE）。
これらの埋め込みは、翻訳ベースのKGEメソッドの新しいバリアントを介して整列します。ここでは、KGのノードと関係の埋め込みは、それぞれガウス分布と翻訳としてモデル化されています。
マルチモデルの事前知識を組み込むことで、画像表現のより正規化された学習が可能になると主張しています。
したがって、モデルは、異なるデータ分布でよりよく一般化することができます。
主要またはマイナーな分布シフトを備えたさまざまな画像分類タスク、すなわちドイツ、中国、ロシアのデータセット間の道路標識分類、Mini-ImagenetデータセットとそのバリエーションとDVM-CARデータセットによる画像分類でKGVを評価します。
結果は、KGVがすべての実験でより高い精度とデータ効率を一貫して示すことを示しています。

要約(オリジナル)

Despite the remarkable success of deep neural networks (DNNs) in computer vision, they fail to remain high-performing when facing distribution shifts between training and testing data. In this paper, we propose Knowledge-Guided Visual representation learning (KGV) – a distribution-based learning approach leveraging multi-modal prior knowledge – to improve generalization under distribution shift. It integrates knowledge from two distinct modalities: 1) a knowledge graph (KG) with hierarchical and association relationships; and 2) generated synthetic images of visual elements semantically represented in the KG. The respective embeddings are generated from the given modalities in a common latent space, i.e., visual embeddings from original and synthetic images as well as knowledge graph embeddings (KGEs). These embeddings are aligned via a novel variant of translation-based KGE methods, where the node and relation embeddings of the KG are modeled as Gaussian distributions and translations, respectively. We claim that incorporating multi-model prior knowledge enables more regularized learning of image representations. Thus, the models are able to better generalize across different data distributions. We evaluate KGV on different image classification tasks with major or minor distribution shifts, namely road sign classification across datasets from Germany, China, and Russia, image classification with the mini-ImageNet dataset and its variants, as well as the DVM-CAR dataset. The results demonstrate that KGV consistently exhibits higher accuracy and data efficiency across all experiments.

arxiv情報

著者	Hongkuan Zhou,Lavdim Halilaj,Sebastian Monka,Stefan Schmid,Yuqicheng Zhu,Bo Xiong,Steffen Staab
発行日	2025-02-12 13:22:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Gramian Multimodal Representation Learning and Alignment

投稿日: 2025年2月13日作成者: jarxiv

要約

人間の知覚は、ビジョン、聴覚、言語などの複数のモダリティを、周囲の現実の統一された理解に統合します。
最近のマルチモーダルモデルは、対照学習を介してモダリティのペアを調整することにより大きな進歩を遂げましたが、複数のモダリティにスケーリングする場合、そのソリューションは不適切です。
これらのモデルは通常、すべてのモダリティの整合性を互いに確実にすることなく、各モダリティを指定されたアンカーに整列させ、複数のモダリティの共同理解を必要とするタスクの最適ではないパフォーマンスにつながります。
この論文では、マルチモーダル学習に対するペアワイズの従来のアプローチを構造的に再考し、上記の制限を克服する新しいグラミアン表現アライメント測定（GRAM）を提示します。
GRAMは、モダリティベクトルに及ぶ$ k $ dimensional ParelalElotopeのグラミア容積を最小限に抑え、すべてのモダリティの幾何学的アライメントを同時に確保することにより、モダリティ埋め込みが嘘をつく高次元空間に$ n $モダリティを直接学習し、整列させます。
グラムは、下流の方法でコサインの類似性を置き換え、2〜 $ n $モダリティを保持し、以前の類似性測定に関してより意味のあるアライメントを提供できます。
新しいグラムベースのコントラスト損失関数は、高次元の埋め込み空間におけるマルチモーダルモデルのアラインメントを強化し、ビデオオーディオテキスト検索やオーディオビデオ分類などの下流タスクで新しい最先端のパフォーマンスをもたらします。
プロジェクトページ、コード、および事前に処理されたモデルは、https：//ispamm.github.io/gram/で入手できます。

要約(オリジナル)

Human perception integrates multiple modalities, such as vision, hearing, and language, into a unified understanding of the surrounding reality. While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities. In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the $k$-dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously. GRAM can replace cosine similarity in any downstream method, holding for 2 to $n$ modalities and providing more meaningful alignment with respect to previous similarity measures. The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification. The project page, the code, and the pretrained models are available at https://ispamm.github.io/GRAM/.

arxiv情報

著者	Giordano Cicchetti,Eleonora Grassucci,Luigi Sigillo,Danilo Comminiello
発行日	2025-02-12 13:25:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems

投稿日: 2025年2月13日作成者: jarxiv

要約

ビデオデータの爆発的な成長により、クラウドエッジターミナルコラボレーティブ（CETC）システムにおける分散ビデオ分析の開発が促進され、効率的なビデオ処理、リアルタイム推論、プライバシー推定分析が可能になりました。
複数の利点の中には、CETCシステムはビデオ処理タスクを配布し、クラウド、エッジ、および端末デバイス全体で適応的な分析を可能にし、ビデオ監視、自律運転、スマートシティのブレークスルーにつながります。
この調査では、エッジコンピューティングプラットフォームとリソース管理メカニズムとともに、階層、分散、ハイブリッドフレームワークを含む基本的なアーキテクチャコンポーネントを最初に分析します。
これらの基礎に基づいて、エッジ中心のアプローチは、オンデバイス処理、エッジアシストオフロード、およびエッジインテリジェンスを強調し、クラウド中心の方法は複雑なビデオ理解とモデルトレーニングのための強力な計算機能を活用します。
また、私たちの調査では、システム全体でパフォーマンスを最適化するリソースを意識するスケジューリング手法を組み込んだ適応タスクを組み込んだハイブリッドビデオ分析もカバーしています。
従来のアプローチを超えて、最近の大規模な言語モデルとマルチモーダル統合の進歩により、プラットフォームのスケーラビリティ、データ保護、システムの信頼性における機会と課題の両方が明らかになりました。
将来の方向には、説明可能なシステム、効率的な処理メカニズム、高度なビデオ分析も含まれ、この動的分野で研究者と実践者に貴重な洞察を提供します。

要約(オリジナル)

The explosive growth of video data has driven the development of distributed video analytics in cloud-edge-terminal collaborative (CETC) systems, enabling efficient video processing, real-time inference, and privacy-preserving analysis. Among multiple advantages, CETC systems can distribute video processing tasks and enable adaptive analytics across cloud, edge, and terminal devices, leading to breakthroughs in video surveillance, autonomous driving, and smart cities. In this survey, we first analyze fundamental architectural components, including hierarchical, distributed, and hybrid frameworks, alongside edge computing platforms and resource management mechanisms. Building upon these foundations, edge-centric approaches emphasize on-device processing, edge-assisted offloading, and edge intelligence, while cloud-centric methods leverage powerful computational capabilities for complex video understanding and model training. Our investigation also covers hybrid video analytics incorporating adaptive task offloading and resource-aware scheduling techniques that optimize performance across the entire system. Beyond conventional approaches, recent advances in large language models and multimodal integration reveal both opportunities and challenges in platform scalability, data protection, and system reliability. Future directions also encompass explainable systems, efficient processing mechanisms, and advanced video analytics, offering valuable insights for researchers and practitioners in this dynamic field.

arxiv情報

著者	Linxiao Gong,Hao Yang,Gaoyun Fang,Bobo Ju,Juncen Guo,Xiaoguang Zhu,Yan Wang,Xiping Hu,Peng Sun,Azzedine Boukerche
発行日	2025-02-12 13:25:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG, cs.NI | コメントを受け付けていません

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification

投稿日: 2025年2月13日作成者: jarxiv

要約

複数のインスタンス学習（MIL）ベースのフレームワークは、デジタル病理におけるギガピクセルサイズと階層画像コンテキストを備えたスライド画像（WSI）全体を処理するための主流になりました。
ただし、これらの方法は、かなりの数のバッグレベルのラベルに大きく依存しており、データ分布のばらつきによって簡単に影響を受ける元のスライドからのみ学習します。
最近、Vision Language Model（VLM）ベースのメソッドが、大規模な病理学的画像テキストペアでトレーニング前に事前に言語を導入しました。
ただし、以前のテキストプロンプトには病理学的事前知識の考慮が欠けているため、モデルのパフォーマンスを大幅に向上させません。
さらに、そのようなペアとトレーニング前のプロセスの収集は非常に時間がかかり、ソース集約型です。上記の問題を解決するために、スライド全体のデュアルスケールビジョン言語複数インスタンス学習（VILA-MIL）フレームワークを提案します。
画像分類。
具体的には、VLMのパフォーマンスを効果的に向上させるために、凍結した大手言語モデル（LLM）に基づいて、デュアルスケールの視覚的記述テキストプロンプトを提案します。
VLMを転送してWSIを効率的に処理するために、画像ブランチの場合、同様のパッチを同じプロトタイプにグループ化することにより、パッチ機能を徐々に集計するためにプロトタイプ誘導パッチデコーダーを提案します。
テキストブランチには、マルチ粒画像のコンテキストを組み込むことにより、テキスト機能を強化するためのコンテキストガイド付きテキストデコーダーを紹介します。
3つのマルチキャンサーとマルチセンターサブタイピングデータセットに関する広範な研究は、Vila-Milの優位性を示しています。

要約(オリジナル)

Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model’s performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and source-intensive.To solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.

arxiv情報

著者	Jiangbo Shi,Chen Li,Tieliang Gong,Yefeng Zheng,Huazhu Fu
発行日	2025-02-12 13:28:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント