「cs.CV」カテゴリーアーカイブ

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

投稿日: 2024年10月29日作成者: jarxiv

要約検索エンジンでは未知の情報をテキストで検索することができます。ただし、モ … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.IR, cs.LG | コメントを受け付けていません

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

投稿日: 2024年10月29日作成者: jarxiv

要約概念の粒度に焦点を当てて、画像テキスト検索 (ITR) 評価パイプラインの … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.IR | コメントを受け付けていません

Multi-modal AI for comprehensive breast cancer prognostication

投稿日: 2024年10月29日作成者: jarxiv

要約乳がんの治療選択は、分子サブタイプと臨床的特徴によって決まります。再発リ … 続きを読む →

カテゴリー: cs.AI, cs.CV, eess.IV | コメントを受け付けていません

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

投稿日: 2024年10月29日作成者: jarxiv

要約 Large Vision-Language Model (LVLM) は、 … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

投稿日: 2024年10月29日作成者: jarxiv

要約自己回帰 (AR) 生成モデルの現在のビデオトークン化方法の制限を克服す … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

投稿日: 2024年10月29日作成者: jarxiv

要約近年、スケールアップは視覚と言語の分野で大きな成功をもたらしました。しか … 続きを読む →

カテゴリー: cs.CV, cs.MM, cs.SD, eess.AS | コメントを受け付けていません

On Inductive Biases That Enable Generalization of Diffusion Transformers

投稿日: 2024年10月29日作成者: jarxiv

要約 UNet ベースのデノイザーを使用した拡散モデルの一般化を研究する最近の研 … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

投稿日: 2024年10月29日作成者: jarxiv

要約アクションの逐次実行と、さまざまな抽象化レベルで構成されるその階層構造は、 … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

x-RAGE: eXtended Reality — Action & Gesture Events Dataset

投稿日: 2024年10月29日作成者: jarxiv

要約メタバースの出現と近年のウェアラブルデバイスへの注目により、ジェスチャ … 続きを読む →

カテゴリー: cs.CV, cs.ET | コメントを受け付けていません

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

投稿日: 2024年10月29日作成者: jarxiv

要約非侵襲脳活動からの静的視覚刺激の再構成 fMRI は、CLIP や安定拡散 … 続きを読む →

カテゴリー: cs.AI, cs.CV, eess.IV | コメントを受け付けていません

「cs.CV」カテゴリーアーカイブ

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Multi-modal AI for comprehensive breast cancer prognostication

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

On Inductive Biases That Enable Generalization of Diffusion Transformers

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

x-RAGE: eXtended Reality — Action & Gesture Events Dataset

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

最近の投稿

最近のコメント

アーカイブ

カテゴリー