投稿者「jarxiv」のアーカイブ

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

投稿日: 2025年6月6日作成者: jarxiv

要約画像やビデオの包括的な地域レベルの視覚的理解のための概念的に簡単かつ効率的 … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

投稿日: 2025年6月6日作成者: jarxiv

要約考え方の推論と強化学習（RL）がNLPのブレークスルーを駆動していますが、 … 続きを読む →

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

投稿日: 2025年6月6日作成者: jarxiv

要約画像とオブジェクトインスタンス間で意味的に類似したポイント間の対応を見つけ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

MARBLE: Material Recomposition and Blending in CLIP-Space

投稿日: 2025年6月6日作成者: jarxiv

要約模範的な画像に基づいた画像内のオブジェクトの資料の編集は、コンピュータービ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

投稿日: 2025年6月6日作成者: jarxiv

要約ニューラルレンダリングは、3D再構成と新規ビューの合成に大きな進歩を遂げま … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

投稿日: 2025年6月6日作成者: jarxiv

要約 2Dビジョン言語モデル（VLMS）の顕著な進歩は、3D質問応答、密度の高い … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

投稿日: 2025年6月6日作成者: jarxiv

要約深さマップは、フィードフォワード3Dガウススプラッティング（3DG）パイプ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

投稿日: 2025年6月6日作成者: jarxiv

要約ビデオの理解の進歩にもかかわらず、現在のMLLMはタスクのカウントに苦労し … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

投稿日: 2025年6月6日作成者: jarxiv

要約 Chain-of-Thought（COT）は、大規模な言語モデル（LLM） … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

投稿日: 2025年6月6日作成者: jarxiv

要約最近の長い形式のビデオ言語理解ベンチマークは、ビデオの大規模なマルチモーダ … 続きを読む →

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

投稿者「jarxiv」のアーカイブ

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

MARBLE: Material Recomposition and Blending in CLIP-Space

ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

最近の投稿

最近のコメント

アーカイブ

カテゴリー