Visually grounded few-shot word learning in low-resource settings


私たちのアプローチには、与えられた単語と画像の例のペアを使用して、ラベルのない音声と画像の大規模なコレクションから新しい教師なしの単語と画像のトレーニング ペアをマイニングすることが含まれます。
ヨルバ語での実験では、大規模な英語の音声画像データでトレーニングされたマルチモーダル モデルから知識を伝達する利点が示されています。


We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. Moreover, all previous studies were performed using English speech-image data. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yoruba. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelledspeech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark. Many of the model’s mistakes are due to confusion between visual concepts co-occurring in similar contexts. The experiments on Yoruba show the benefit of transferring knowledge from a multimodal model trained on a larger set of English speech-image data.


著者 Leanne Nortje,Dan Oneata,Herman Kamper
発行日 2023-06-21 07:22:08+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.CL, eess.AS パーマリンク