Domain Adaptive Video Semantic Segmentation via Cross-Domain Moving Object Mixing

要約

ドメイン適合のために学習されたネットワークは、移行しやすいクラスに偏りやすい。この偏りの問題は、学習時に対象領域に関するグランドトゥルースラベルが得られないため、予測値が偏り、転送困難なクラスの予測を忘れてしまうことにつながる。この問題を解決するために、我々はCross-domain Moving Object Mixing (CMOM) を提案する。これは、ソースドメインのビデオクリップから、転送しにくいクラスを含む複数のオブジェクトを切り出し、ターゲットドメインのビデオクリップに貼り付けるものである。画像レベルの領域適応とは異なり，2つの異なる映像に含まれる移動物体を混合するためには，時間的文脈を維持する必要がある．そこで，非現実的な動きが発生しないように，連続した動画像フレームで混合するようにCMOMを設計する．さらに、ターゲット領域の特徴量の識別性を高めるために、時間的コンテキストを用いた特徴量アライメント（FATC）を提案する。FATCは、教師無しで、信頼性の低い予測を時間的コンセンサスでフィルタリングすることにより、識別可能なターゲットドメイン特徴を学習するために、グランドトゥルースラベルで訓練された頑健なソースドメイン特徴を利用する。我々は、広範な実験を通して、提案するアプローチの有効性を実証する。特に、VIPER to Cityscapes-Seqベンチマークでは53.81%のmIoUを、SYNTHIA-Seq to Cityscapes-Seqベンチマークでは56.31%のmIoUを達成し、最先端手法を大きく凌駕することを示した。

要約(オリジナル)

The network trained for domain adaptation is prone to bias toward the easy-to-transfer classes. Since the ground truth label on the target domain is unavailable during training, the bias problem leads to skewed predictions, forgetting to predict hard-to-transfer classes. To address this problem, we propose Cross-domain Moving Object Mixing (CMOM) that cuts several objects, including hard-to-transfer classes, in the source domain video clip and pastes them into the target domain video clip. Unlike image-level domain adaptation, the temporal context should be maintained to mix moving objects in two different videos. Therefore, we design CMOM to mix with consecutive video frames, so that unrealistic movements are not occurring. We additionally propose Feature Alignment with Temporal Context (FATC) to enhance target domain feature discriminability. FATC exploits the robust source domain features, which are trained with ground truth labels, to learn discriminative target domain features in an unsupervised manner by filtering unreliable predictions with temporal consensus. We demonstrate the effectiveness of the proposed approaches through extensive experiments. In particular, our model reaches mIoU of 53.81% on VIPER to Cityscapes-Seq benchmark and mIoU of 56.31% on SYNTHIA-Seq to Cityscapes-Seq benchmark, surpassing the state-of-the-art methods by large margins.

arxiv情報

著者	Kyusik Cho,Suhyeon Lee,Hongje Seong,Euntai Kim
発行日	2022-11-04 08:10:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Domain Adaptive Video Semantic Segmentation via Cross-Domain Moving Object Mixing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー