Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

要約

Masked Autoencoders は、シンプルでありながら強力な自己教師あり学習方法です。
ただし、マスクされた入力パッチを再構築することにより、表現を間接的に学習します。
いくつかの方法は、マスクされたパッチの表現を予測することにより、表現を直接学習します。
ただし、すべてのパッチを使用してトレーニング信号表現をエンコードすることは最適ではないと考えています。
マスクされたパッチのみを使用してトレーニング信号を取得しながら、表現を直接学習する新しい方法、Masked Modeling Duo (M2D) を提案します。
M2D では、オンラインネットワークが可視パッチをエンコードし、マスクされたパッチ表現を予測し、運動量エンコーダーであるターゲットネットワークがマスクされたパッチをエンコードします。
ターゲット表現をより適切に予測するために、オンラインネットワークは入力を適切にモデル化する必要がありますが、ターゲットネットワークもそれを適切にモデル化してオンライン予測と一致する必要があります。
次に、学習した表現は入力をより適切にモデル化する必要があります。
汎用オーディオ表現を学習することで M2D を検証し、M2D は、UrbanSound8K、VoxCeleb1、AudioSet20K、GTZAN、SpeechCommandsV2 などのタスクで新しい最先端のパフォーマンスを設定しました。
さらに、付録の ImageNet-1K を使用して、画像に対する M2D の有効性を検証します。

要約(オリジナル)

Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches. In the M2D, the online network encodes visible patches and predicts masked patch representations, and the target network, a momentum encoder, encodes masked patches. To better predict target representations, the online network should model the input well, while the target network should also model it well to agree with online predictions. Then the learned representations should better model the input. We validated the M2D by learning general-purpose audio representations, and M2D set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2. We additionally validate the effectiveness of M2D for images using ImageNet-1K in the appendix.

arxiv情報

著者	Daisuke Niizumi,Daiki Takeuchi,Yasunori Ohishi,Noboru Harada,Kunio Kashino
発行日	2022-11-18 07:20:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー