「cs.LG」カテゴリーアーカイブ

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

投稿日: 2025年6月12日作成者: jarxiv

要約画像のシーケンス上の推論は、マルチモーダルの大手言語モデル（MLLMS）に … 続きを読む →

カテゴリー: cs.CL, cs.CV, cs.LG | コメントを受け付けていません

Canonical Latent Representations in Conditional Diffusion Models

投稿日: 2025年6月12日作成者: jarxiv

要約条件付き拡散モデル（CDM）は、さまざまな生成タスクで印象的なパフォーマン … 続きを読む →

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

投稿日: 2025年6月12日作成者: jarxiv

要約医学的視覚的質問応答（MEDVQA）は、臨床的意思決定支援システムを開発す … 続きを読む →

カテゴリー: 68T45, 92C55, cs.CV, cs.LG, I.2.10 | コメントを受け付けていません

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

投稿日: 2025年6月12日作成者: jarxiv

要約現代のAIの主な課題は、世界を理解し、観察によって主に行動することを学ぶこ … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG, cs.RO | コメントを受け付けていません

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

投稿日: 2025年6月12日作成者: jarxiv

要約ビデオ言語モデルの時空間的理解と推論能力を評価するための既存のベンチマーク … 続きを読む →

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

投稿日: 2025年6月12日作成者: jarxiv

要約生成AIの最近の進歩に支えられたテキスト誘導画像編集は、ますます広まってい … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

Spectral Image Tokenizer

投稿日: 2025年6月12日作成者: jarxiv

要約画像トークナーは、画像を離散トークンのシーケンスにマッピングし、自己回帰ト … 続きを読む →

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

投稿日: 2025年6月12日作成者: jarxiv

要約軌跡の自己回帰モデリングに基づいて構築された新しいVisuo-Motorポ … 続きを読む →

カテゴリー: cs.CV, cs.LG, cs.RO | コメントを受け付けていません

Text-Aware Image Restoration with Diffusion Models

投稿日: 2025年6月12日作成者: jarxiv

要約画像修復は、劣化した画像を回復することを目的としています。しかし、既存の … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

投稿日: 2025年6月12日作成者: jarxiv

要約変形可能なガウススプラット大きな再構成モデル（DGS-LRM）を紹介し … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.GR, cs.LG | コメントを受け付けていません

「cs.LG」カテゴリーアーカイブ

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

Canonical Latent Representations in Conditional Diffusion Models

Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

Spectral Image Tokenizer

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Text-Aware Image Restoration with Diffusion Models

DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

最近の投稿

最近のコメント

アーカイブ

カテゴリー