「cs.CV」カテゴリーアーカイブ

Towards Interpreting Visual Information Processing in Vision-Language Models

投稿日: 2024年10月10日作成者: jarxiv

要約視覚言語モデル (VLM) は、テキストと画像を処理および理解するための強 … 続きを読む →

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

CHASE: Learning Convex Hull Adaptive Shift for Skeleton-based Multi-Entity Action Recognition

投稿日: 2024年10月10日作成者: jarxiv

要約スケルトンベースのマルチエンティティのアクション認識は、複数の多様なエンテ … 続きを読む →

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis

投稿日: 2024年10月10日作成者: jarxiv

要約拡散モデルの最近の進歩により、画像とビデオの生成において優れた機能が実証さ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

投稿日: 2024年10月10日作成者: jarxiv

要約この論文では、見落とされているが重要なタスク Graph2Image、つま … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG, cs.SI | コメントを受け付けていません

AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

投稿日: 2024年10月10日作成者: jarxiv

要約拡散モデルの最近の進歩により、4D 全身ヒューマンオブジェクトインタラ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

投稿日: 2024年10月10日作成者: jarxiv

要約ラージビジョン言語モデル (LVLM) のマルチモーダル事前トレーニング … 続きを読む →

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

投稿日: 2024年10月10日作成者: jarxiv

要約 RPG、Stable Diffusion 3、FLUX などの高度な拡散モ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Do better language models have crisper vision?

投稿日: 2024年10月10日作成者: jarxiv

要約テキストのみの大規模言語モデル (LLM) は、視覚的な世界をどの程度理解 … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

MM-Ego: Towards Building Egocentric Multimodal LLMs

投稿日: 2024年10月10日作成者: jarxiv

要約この研究は、自己中心的なビデオ理解のためのマルチモーダル基盤モデルの構築を … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

投稿日: 2024年10月10日作成者: jarxiv

要約複数のタスクを完了できるジェネラリストの身体化エージェントを学習するには、 … 続きを読む →

カテゴリー: cs.CV, cs.LG, cs.RO | コメントを受け付けていません

「cs.CV」カテゴリーアーカイブ

Towards Interpreting Visual Information Processing in Vision-Language Models

CHASE: Learning Convex Hull Adaptive Shift for Skeleton-based Multi-Entity Action Recognition

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Do better language models have crisper vision?

MM-Ego: Towards Building Egocentric Multimodal LLMs

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

最近の投稿

最近のコメント

アーカイブ

カテゴリー