jarxiv | Japanese arxiv | ページ 169

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

投稿日: 2025年6月9日作成者: jarxiv

要約

自律運転のためのビジョン言語モデル（VLMS）の全体的な理解をベンチマークするシナリオベースのフレームワークであるStsbenchを紹介します。
このフレームワークは、グラウンドトゥルースアノテーションを使用して、あらゆるデータセットから事前に定義されたトラフィックシナリオを自動的に採掘し、効率的な人間の検証のための直感的なユーザーインターフェイスを提供し、モデル評価のために複数選択の質問を生成します。
Nuscenes Datasetに適用されると、包括的な3D認識に基づいてVLMの時空間的推論能力を評価する最初のベンチマークであるStsnuを提示します。
既存のベンチマークは、通常、単一の視点からの画像またはビデオの既製または微調整されたVLMを対象とし、オブジェクト認識、密なキャプション、リスク評価、シーンの理解などのセマンティックタスクに焦点を当てます。
対照的に、STSNUは、マルチビューカメラまたはLIDARのビデオで動作するエンドツーエンドの運転のために、ドライビングエキスパートVLMSを評価します。
自律車両の重要な能力である交通参加者間のエゴ車の行動と複雑な相互作用の両方について推論する能力を具体的に評価します。
ベンチマークには、複数のビューとフレームにまたがる43の多様なシナリオがあり、その結果、971人の人間が検証した多肢選択式の質問があります。
徹底的な評価は、複雑な環境での基本的な交通ダイナミクスについて推論する既存のモデルの能力における重要な欠点を明らかにします。
これらの調査結果は、時空間の推論を明示的にモデル化する建築的進歩の緊急の必要性を強調しています。
STSBenchは、時空間評価のコアギャップに対処することにより、自律運転のために、より堅牢で説明可能なVLMの開発を可能にします。

要約(オリジナル)

We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models’ ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.

arxiv情報

著者	Christian Fruhwirth-Reisinger,Dušan Malić,Wei Lin,David Schinagl,Samuel Schulter,Horst Possegger
発行日	2025-06-06 16:25:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

GenIR: Generative Visual Feedback for Mental Image Retrieval

投稿日: 2025年6月9日作成者: jarxiv

要約

Vision-Language Models（VLMS）は、テキストから画像の検索ベンチマークで強力なパフォーマンスを示しています。
ただし、この成功を実際のアプリケーションに埋めることは依然として課題です。
実際には、人間の検索行動が1ショットのアクションになることはめったにありません。
代わりに、それは多くの場合、手がかりを念頭に置いた多ラウンドのプロセスです。つまり、曖昧な回想からターゲットイメージの鮮明な精神的表現に至るまでの精神的なイメージです。
このギャップに動機付けられて、私たちはメンタルイメージ取得（miR）のタスクを研究します。これは、ユーザーが画像検索エンジンとのマルチラウンドの相互作用を通じて精神的に想定された画像の検索を改良する現実的ではないが未解決の設定をターゲットにしています。
インタラクティブ検索の成功の中心は、ユーザーに明確で実用的なフィードバックを提供するマシンの機能です。
ただし、既存の方法は、ユーザーがクエリを改良するのが曖昧、誤解を招く、または効果的でない可能性がある間接的または抽象的な言葉によるフィードバックに依存しています。
これを克服するために、拡散ベースの画像生成を活用して、各ラウンドでのAIシステムの理解を明示的に具体化する生成的多ラウンド検索パラダイムであるGenirを提案します。
これらの合成視覚表現は、明確で解釈可能なフィードバックを提供し、ユーザーがクエリを直感的かつ効果的に改良できるようにします。
さらに、高品質のマルチラウンドMIRデータセットを生成するために、完全に自動化されたパイプラインを導入します。
実験結果は、GenirがMIRシナリオで既存のインタラクティブな方法を大幅に上回ることを示しています。
この作業は、データセットと効果的な生成検索方法を備えた新しいタスクを確立し、この方向で将来の研究の基盤を提供します。

要約(オリジナル)

Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system’s understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.

arxiv情報

著者	Diji Yang,Minghao Liu,Chung-Hsiang Lo,Yi Zhang,James Davis
発行日	2025-06-06 16:28:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Fréchet Radiomic Distance (FRD): A Versatile Metric for Comparing Medical Imaging Datasets

投稿日: 2025年6月9日作成者: jarxiv

要約

2つの画像セットが同じまたは異なる分布またはドメインに属するかどうかを判断することは、現代の医療画像分析と深い学習において重要なタスクです。
たとえば、画像生成モデルの出力品質を評価します。
現在、このタスクに使用されるメトリックは、セグメンテーションなどの下流タスクの（潜在的に偏った）選択に依存しているか、タスクに依存しない知覚メトリック（例えば、fr \ ‘echetインセプション距離/fid）を自然なイメージングから採用しています。
この目的のために、標準化された臨床的に意味のある、解釈可能な画像機能を利用する、医療画像に合わせた新しい知覚メトリック（FR \ ‘Echet Radiomic距離）を紹介します。
FRDは、ドメイン外（OOD）検出、画像間翻訳の評価（下流のタスクのパフォーマンスと解剖学的一貫性とリアリズムとより相関することにより）、および無条件画像生成の評価など、さまざまな医療イメージングアプリケーションの他の画像分布メトリックよりも優れていることを示しています。
さらに、FRDは、低いサンプルサイズでの安定性や計算効率、画像の腐敗や敵対的攻撃に対する感受性、特徴の解釈可能性、放射線科医認識の画質との相関などの追加の利点を提供します。
さらに、医療イメージングにおける画像類似性メトリックの多面的な評価のための広範なフレームワークを提示することにより、医療画像翻訳の生成モデルの最初の大規模な比較研究を含む、将来の研究を促進するためのアクセス可能なコードベースをリリースすることにより、文献の重要なギャップに対処します。
私たちの結果は、さまざまなデータセット、モダリティ、および下流タスクにまたがる徹底的な実験によってサポートされており、医療画像分析のためのFRDの幅広い可能性を強調しています。

要約(オリジナル)

Determining whether two sets of images belong to the same or different distributions or domains is a crucial task in modern medical image analysis and deep learning; for example, to evaluate the output quality of image generative models. Currently, metrics used for this task either rely on the (potentially biased) choice of some downstream task, such as segmentation, or adopt task-independent perceptual metrics (e.g., Fr\’echet Inception Distance/FID) from natural imaging, which we show insufficiently capture anatomical features. To this end, we introduce a new perceptual metric tailored for medical images, FRD (Fr\’echet Radiomic Distance), which utilizes standardized, clinically meaningful, and interpretable image features. We show that FRD is superior to other image distribution metrics for a range of medical imaging applications, including out-of-domain (OOD) detection, the evaluation of image-to-image translation (by correlating more with downstream task performance as well as anatomical consistency and realism), and the evaluation of unconditional image generation. Moreover, FRD offers additional benefits such as stability and computational efficiency at low sample sizes, sensitivity to image corruptions and adversarial attacks, feature interpretability, and correlation with radiologist-perceived image quality. Additionally, we address key gaps in the literature by presenting an extensive framework for the multifaceted evaluation of image similarity metrics in medical imaging — including the first large-scale comparative study of generative models for medical image translation — and release an accessible codebase to facilitate future research. Our results are supported by thorough experiments spanning a variety of datasets, modalities, and downstream tasks, highlighting the broad potential of FRD for medical image analysis.

arxiv情報

著者	Nicholas Konz,Richard Osuala,Preeti Verma,Yuwen Chen,Hanxue Gu,Haoyu Dong,Yaqian Chen,Andrew Marshall,Lidia Garrucho,Kaisar Kushibar,Daniel M. Lang,Gene S. Kim,Lars J. Grimm,John M. Lewin,James S. Duncan,Julia A. Schnabel,Oliver Diaz,Karim Lekadir,Maciej A. Mazurowski
発行日	2025-06-06 16:36:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG, eess.IV, stat.ML | コメントを受け付けていません

A novel non-convex minimax $p$-th order concave penalty function approach to low-rank tensor completion

投稿日: 2025年6月9日作成者: jarxiv

要約

低ランクのテンソル完了（LRTC）問題は、部分的なサンプル情報からテンソルを再構築することを目的としています。これは、画像処理やコンピュータービジョンなどの幅広い実用的なアプリケーションに大きな関心を集めています。
LRTC問題に採用されているさまざまな手法の中で、テンソル特異値の取り扱いにおける有効性のために非凸弛緩方法が広く研究されており、これは正確なテンソル回復に重要です。
Minimax Concave Paenty（MCP）非凸緩和方法は、LRTCの問題に取り組むことで有望な結果を達成し、広く採用されましたが、顕著な制限を示しています。
この問題に対処し、回復パフォーマンスを強化するために、新しいMinimax $ P $ -TH ORDER凹のペナルティ（MPCP）関数が提案されています。
この新しい機能に基づいて、テンソル$ P $ -TH ORDER $ \ TAU $ NORMは、テンソルランク近似の非凸緩和として提案されているため、MPCPベースのLRTCモデルを確立します。
さらに、理論的収束保証は、提案された方法のために厳密に確立されています。
複数の実際のデータセットで実施された広範な数値実験は、提案された方法が視覚的品質と定量的メトリックの両方で最先端の方法を上回ることを示しています。

要約(オリジナル)

The low-rank tensor completion (LRTC) problem aims to reconstruct a tensor from partial sample information, which has attracted significant interest in a wide range of practical applications such as image processing and computer vision. Among the various techniques employed for the LRTC problem, non-convex relaxation methods have been widely studied for their effectiveness in handling tensor singular values, which are crucial for accurate tensor recovery. While the minimax concave penalty (MCP) non-convex relaxation method has achieved promising results in tackling the LRTC problem and gained widely adopted, it exhibits a notable limitation: insufficient penalty on small singular values during the singular value handling process, resulting in inefficient tensor recovery. To address this issue and enhance recovery performance, a novel minimax $p$-th order concave penalty (MPCP) function is proposed. Based on this novel function, a tensor $p$-th order $\tau$ norm is proposed as a non-convex relaxation for tensor rank approximation, thereby establishing an MPCP-based LRTC model. Furthermore, theoretical convergence guarantees are rigorously established for the proposed method. Extensive numerical experiments conducted on multiple real datasets demonstrate that the proposed method outperforms the state-of-the-art methods in both visual quality and quantitative metrics.

arxiv情報

著者	Hongbing Zhang,Bing Zheng
発行日	2025-06-06 16:43:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Towards an Explainable Comparison and Alignment of Feature Embeddings

投稿日: 2025年6月9日作成者: jarxiv

要約

文献ではいくつかの特徴埋め込みモデルが開発されていますが、これらの埋め込みの比較は、分類関連のダウンストリームアプリケーションでの数値性能に大きく焦点を当てています。
ただし、異なる埋め込みの解釈可能な比較には、埋め込みスペース内でクラスター化されたサンプルグループ間の不一致を特定して分析する必要があります。
この作業では、埋め込みデータを比較し、参照データセットのクラスタリングの違いを特定するために、\ emph {Spectral Pailwise Embedding比較（SPEC）}フレームワークを提案します。
私たちのアプローチでは、2つの埋め込みから派生したカーネルマトリックスを調べ、違いカーネルマトリックスの固有カムをレバレッジして、2つの埋め込みによって異なる方法でキャプチャされるサンプルクラスターを検出します。
このカーネルベースのアプローチのスケーラブルな実装を提示します。これは、サンプルサイズとともに直線的に成長する計算の複雑さを示します。
さらに、このフレームワークを使用して最適化問題を導入して2つの埋め込みを整列させ、1つの埋め込みで識別されたクラスターも他のモデルでキャプチャされるようにします。
ImagenetやMS-Cocoなどの大規模なデータセットの埋め込みを比較および整列させるために、仕様のアプリケーションを示す数値結果を提供します。
このコードは[https://github.com/mjalali/embedding-comparison]（github.com/mjalali/embedding-comparison）で入手できます。

要約(オリジナル)

While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the \emph{Spectral Pairwise Embedding Comparison (SPEC)} framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC’s application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The code is available at [https://github.com/mjalali/embedding-comparison](github.com/mjalali/embedding-comparison).

arxiv情報

著者	Mohammad Jalali,Bahar Dibaei Nia,Farzan Farnia
発行日	2025-06-06 16:50:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG, math.SP | コメントを受け付けていません

Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study

投稿日: 2025年6月9日作成者: jarxiv

要約

従来のコンピュータービジョンモデルは歴史的に内視鏡ドメインに一般化するのに苦労してきましたが、基礎モデルの出現により、有望なクロスドメインのパフォーマンスが示されています。
この作業では、腹腔鏡手術に特に焦点を当てた内視鏡タスクのビジョン言語モデル（VLM）の能力を評価する最初の大規模研究を提示します。
最先端モデルの多様なセット、複数の外科的データセット、および広範な人間の参照注釈を使用して、3つの重要な研究質問に対処します。（1）現在のVLMは、外科的画像の基本的な知覚タスクを解決できますか？
（2）高度なフレームベースの内視鏡シーンの理解タスクを処理できますか？
（3）この文脈では、専門化された医療VLMがジェネラリストモデルとどのように比較されますか？
我々の結果は、VLMSが一般的なドメインタスクに匹敵するパフォーマンスレベルで、オブジェクトのカウントやローカリゼーションなどの基本的な外科的認識タスクを効果的に実行できることを明らかにしています。
ただし、タスクに医学的知識が必要な場合、パフォーマンスは大幅に悪化します。
特に、基本的な外科的タスクと高度な外科的課題の両方にわたるジェネラリストモデルと比較して、現在、専門的な医療VLMがパフォーマンスが低いことがわかり、外科的環境の複雑さのためにまだ最適化されていないことが示唆されています。
これらの発見は、VLMが手術によってもたらされる独自の課題に対処できるようにするためのさらなる進歩の必要性を強調しています。
全体として、私たちの研究は、次世代内視鏡AIシステムの開発に関する重要な洞察を提供し、医療視覚言語モデルの改善のための重要な領域を特定します。

要約(オリジナル)

While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.

arxiv情報

著者	Leon Mayer,Tim Rädsch,Dominik Michael,Lucas Luttner,Amine Yamlahi,Evangelia Christodoulou,Patrick Godau,Marcel Knopp,Annika Reinke,Fiona Kolbinger,Lena Maier-Hein
発行日	2025-06-06 16:53:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Optimizing Cloud-to-GPU Throughput for Deep Learning With Earth Observation Data

投稿日: 2025年6月9日作成者: jarxiv

要約

ペタバイトスケールの地球観測（EO）データに関するディープラーニングモデルのトレーニングには、コンピューティングリソースをデータストレージから分離する必要があります。
ただし、標準のPytorchデータローダーは、クラウドストレージから直接GeoTiffファイルをストリーミングするときに最新のGPUを利用することはできません。
この作業では、CloudオブジェクトストレージとローカルSSDの両方からスループットをGeotiffのロードパラメーターをベンチマークし、さまざまなローダー構成とデータパラメーターを体系的にテストします。
ベイジアンの最適化を使用して、各ストレージタイプの最適な設定を見つけるために、タイルに並んだ読み取りとワーカースレッドプールに焦点を当てています。
最適化された構成により、リモートデータの読み込みスループットが20倍になり、デフォルト設定と比較してローカルスループットが4x増加します。
3つのパブリックEOベンチマークでは、最適化されたリモートロードでトレーニングされたモデルは、同一の時間予算内でローカルトレーニングと同じ精度を実現します。
検証IOUを6〜15％改善し、標準構成で0〜30％に対して85〜95％のGPU使用率を維持します。
コードはhttps://github.com/microsoft/pytorch-cloud-geotiff-optimizationで公開されています

要約(オリジナル)

Training deep learning models on petabyte-scale Earth observation (EO) data requires separating compute resources from data storage. However, standard PyTorch data loaders cannot keep modern GPUs utilized when streaming GeoTIFF files directly from cloud storage. In this work, we benchmark GeoTIFF loading throughput from both cloud object storage and local SSD, systematically testing different loader configurations and data parameters. We focus on tile-aligned reads and worker thread pools, using Bayesian optimization to find optimal settings for each storage type. Our optimized configurations increase remote data loading throughput by 20x and local throughput by 4x compared to default settings. On three public EO benchmarks, models trained with optimized remote loading achieve the same accuracy as local training within identical time budgets. We improve validation IoU by 6-15% and maintain 85-95% GPU utilization versus 0-30% with standard configurations. Code is publicly available at https://github.com/microsoft/pytorch-cloud-geotiff-optimization

arxiv情報

著者	Akram Zaytar,Caleb Robinson,Girmaw Abebe Tadesse,Tammy Glazer,Gilles Hacheme,Anthony Ortiz,Rahul M Dodhia,Juan M Lavista Ferres
発行日	2025-06-06 16:54:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

A Lightweight Dual-Branch System for Weakly-Supervised Video Anomaly Detection on Consumer Edge Devices

投稿日: 2025年6月9日作成者: jarxiv

要約

スマートホームカメラや個人監視システムなど、家電のインテリジェントなセキュリティに対する需要の高まりは、多くの場合、高い計算コストと高度なAIの大きなモデルサイズによって妨げられます。
これらの制限により、リソース制約のエッジデバイスでのリアルタイムビデオアノマリー検出（VAD）の効果的な展開が妨げられます。
このギャップを埋めるために、このペーパーでは、消費者ハードウェアで直接高効率で低複雑さの脅威検出を実現するために設計された斬新で軽量システムであるルールベースのビデオアノマリー検出（Rulevad）を紹介します。
Rulevadは、計算負荷を最小限に抑えるために、革新的な分離されたデュアルブランチアーキテクチャを特徴としています。
暗黙のブランチは、視覚的な機能を使用して、迅速で粗粒のバイナリ分類を行い、不必要な処理を避けるために通常のアクティビティを効率的に除外します。
潜在的に異常または複雑なイベントの場合、マルチモーダルの明示的ブランチが引き継ぎます。
このブランチは、Yolo-Worldを活用してオブジェクトを検出し、データマイニングを適用して、シーンから解釈可能なテキストベースの関連ルールを生成します。
これらのルールを視覚データに合わせることにより、Rulevadはより微妙で微調整された分類を実現し、視覚のみのシステムで一般的な誤ったアラームを大幅に削減します。
XD暴力およびUCF犯罪ベンチマークデータセットに関する広範な実験は、Rulevadが優れたパフォーマンスを達成し、精度と速度の両方で既存の最先端の方法を上回ることを示しています。
重要なことに、システム全体が低電力操作に最適化されており、Nvidia Jetson Nanoボードに完全に展開でき、日常の消費者電子機器に高度でリアルタイムのセキュリティ監視を提供するための実用的な実現可能性を実証しています。

要約(オリジナル)

The growing demand for intelligent security in consumer electronics, such as smart home cameras and personal monitoring systems, is often hindered by the high computational cost and large model sizes of advanced AI. These limitations prevent the effective deployment of real-time Video Anomaly Detection (VAD) on resource-constrained edge devices. To bridge this gap, this paper introduces Rule-based Video Anomaly Detection (RuleVAD), a novel, lightweight system engineered for high-efficiency and low-complexity threat detection directly on consumer hardware. RuleVAD features an innovative decoupled dual-branch architecture to minimize computational load. An implicit branch uses visual features for rapid, coarse-grained binary classification, efficiently filtering out normal activity to avoid unnecessary processing. For potentially anomalous or complex events, a multimodal explicit branch takes over. This branch leverages YOLO-World to detect objects and applies data mining to generate interpretable, text-based association rules from the scene. By aligning these rules with visual data, RuleVAD achieves a more nuanced, fine-grained classification, significantly reducing the false alarms common in vision-only systems. Extensive experiments on the XD-Violence and UCF-Crime benchmark datasets show that RuleVAD achieves superior performance, surpassing existing state-of-the-art methods in both accuracy and speed. Crucially, the entire system is optimized for low-power operation and is fully deployable on an NVIDIA Jetson Nano board, demonstrating its practical feasibility for bringing advanced, real-time security monitoring to everyday consumer electronic devices.

arxiv情報

著者	Wen-Dong Jiang,Chih-Yung Chang,Ssu-Chi Kuai,Diptendu Sinha Roy
発行日	2025-06-06 17:04:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

投稿日: 2025年6月9日作成者: jarxiv

要約

マルチモーダルの大手言語モデルの最近の進歩は、視覚的な質問に応答するブレークスルーを駆り立てています。
しかし、重要なギャップは続きます。「概念化」 – 視覚的な形のバリエーション、人間の推論の基本的な能力にもかかわらず、同じ概念について認識する能力と推論です。
この課題に対処するために、AIシステムの視覚抽象化の能力を評価および改善するために設計された6つのグラフベースのタスクを備えたデータセットであるVisual Graph Arena（VGA）を紹介します。
VGAは、多様なグラフレイアウト（例えば、カマダ – カワイ対平面など）を使用して、視覚的な形式とは無関係に推論をテストします。
最先端のビジョンモデルとマルチモーダルLLMを使用した実験では、顕著な格差が明らかになりました。人間はタスク全体でほぼ完璧な精度を達成しましたが、モデルは同型検出で完全に失敗し、パス/サイクルタスクで限られた成功を示しました。
さらに、真の理解ではなく、擬似知能パターンマッチングを示唆する行動異常を特定します。
これらの調査結果は、視覚的理解のための現在のAIモデルの基本的な制限を強調しています。
表現不変の推論の課題を分離することにより、VGAは、AI視覚モデルの人間のような概念化に向けて進歩を促進するためのフレームワークを提供します。
Visual Graph Arenaは、\ href {https://vga.csail.mit.edu/} {vga.csail.mit.edu}で入手できます。

要約(オリジナル)

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization’-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}

arxiv情報

著者	Zahra Babaiee,Peyman M. Kiasari,Daniela Rus,Radu Grosu
発行日	2025-06-06 17:06:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

投稿日: 2025年6月9日作成者: jarxiv

要約

このペーパーでは、大規模なデータとモデルの時代における信頼できるガードレールの重要な必要性に対処するVLMベースのビジョン保護手段であるLlavaguardを紹介します。
この目的のために、カスタマイズ可能な安全分類法、データの前処理、増強、トレーニングのセットアップを説明する新しいオープンフレームワークを確立します。
安全性に関するVLMセーフガードを教えるために、高品質の人間の専門家注釈を備えたマルチモーダル安全データセットを作成します。各画像には、安全評価、カテゴリ、および根拠がラベル付けされています。
また、コンテキスト固有の評価をサポートするために、高度な増強を採用しています。
0.5bから7bの範囲の結果として得られるLlavaguardモデルは、柔軟なポリシーに対する視覚コンテンツの安全コンプライアンスを評価するための汎用性の高いツールとして機能します。
包括的な実験では、Llavaguardは、最先端のセーフガードとVLMの両方を精度と柔軟に処理して、さまざまなポリシーを柔軟に処理します。
さらに、2つの実際のアプリケーションでLlavaguardのパフォーマンスを示します。つまり、大規模なデータセットアノテーションとテキストから画像モデルのモデレーションです。
データセット、モデルの重み、トレーニングコードなど、フレームワーク全体を作成します。

要約(オリジナル)

This paper introduces LlavaGuard, a suite of VLM-based vision safeguards that address the critical need for reliable guardrails in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting LlavaGuard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, LlavaGuard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate LlavaGuard’s performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code.

arxiv情報

著者	Lukas Helff,Felix Friedrich,Manuel Brack,Kristian Kersting,Patrick Schramowski
発行日	2025-06-06 17:08:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント