Towards Language-Driven Video Inpainting via Multimodal Large Language Models

要約

新しいタスクである言語駆動型ビデオ修復を導入します。これは自然言語命令を使用して修復プロセスをガイドします。
このアプローチは、手動でラベル付けされたバイナリマスクに依存する従来のビデオ修復方法の制限を克服します。このプロセスは、多くの場合、退屈で労働集約的です。
このタスクのトレーニングと評価をサポートするために、5,650 個のビデオと 9,091 個の修復結果を含む、指示によるビデオからのオブジェクトの削除 (ROVI) データセットを紹介します。
また、マルチモーダル大規模言語モデルを統合して、複雑な言語ベースの修復リクエストを効果的に理解して実行する、このタスクの最初のエンドツーエンドのベースラインである、新しい拡散ベースの言語駆動型ビデオ修復フレームワークも提案します。
私たちの包括的な結果は、さまざまな言語で指示された修復シナリオにおけるデータセットの多用途性とモデルの有効性を示しています。
データセット、コード、モデルを公開します。

要約(オリジナル)

We introduce a new task — language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset’s versatility and the model’s effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.

arxiv情報

著者	Jianzong Wu,Xiangtai Li,Chenyang Si,Shangchen Zhou,Jingkang Yang,Jiangning Zhang,Yining Li,Kai Chen,Yunhai Tong,Ziwei Liu,Chen Change Loy
発行日	2024-01-18 18:59:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー