「cs.CV」カテゴリーアーカイブ

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

投稿日: 2025年6月6日作成者: jarxiv

要約 2Dビジョン言語モデル（VLMS）の顕著な進歩は、3D質問応答、密度の高い … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

投稿日: 2025年6月6日作成者: jarxiv

要約深さマップは、フィードフォワード3Dガウススプラッティング（3DG）パイプ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

投稿日: 2025年6月6日作成者: jarxiv

要約ビデオの理解の進歩にもかかわらず、現在のMLLMはタスクのカウントに苦労し … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

投稿日: 2025年6月6日作成者: jarxiv

要約 Chain-of-Thought（COT）は、大規模な言語モデル（LLM） … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

投稿日: 2025年6月6日作成者: jarxiv

要約最近の長い形式のビデオ言語理解ベンチマークは、ビデオの大規模なマルチモーダ … 続きを読む →

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

投稿日: 2025年6月6日作成者: jarxiv

要約時空間局在は、生物学的研究から自律的なナビゲーションやインタラクティブなイ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh

投稿日: 2025年6月6日作成者: jarxiv

要約テクスチャメッシュと対応するマルチビューパノラマ画像として表される屋内スペ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

投稿日: 2025年6月6日作成者: jarxiv

要約具体化されたAIおよびデジタルコンテンツの作成には、現実的な3D屋内シーン … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Refer to Anything with Vision-Language Prompts

投稿日: 2025年6月6日作成者: jarxiv

要約最近の画像セグメンテーションモデルは、画像を視覚エンティティの高品質のマス … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

ContentV: Efficient Training of Video Generation Models with Limited Compute

投稿日: 2025年6月6日作成者: jarxiv

要約ビデオ生成の最近の進歩は、計算コストのエスカレートを緩和するためにますます … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

「cs.CV」カテゴリーアーカイブ

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh

Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Refer to Anything with Vision-Language Prompts

ContentV: Efficient Training of Video Generation Models with Limited Compute

最近の投稿

最近のコメント

アーカイブ

カテゴリー