Visually grounded few-shot word learning in low-resource settings

要約

我々は、ほんの数個の単語と画像の例のペアから新しい単語とその視覚的描写を学習する、視覚に基づいた音声モデルを提案します。
一連のテスト画像と音声クエリが与えられた場合、どの画像がクエリ単語を表しているかをモデルに尋ねます。
これまでの研究では、数字の単語と画像のペアによる人為的な設定を使用するか、クラスごとに多数の例を使用することによって、この数回の学習問題を簡素化しました。
さらに、これまでの研究はすべて英語の音声画像データを使用して実行されました。
私たちは、自然な単語と画像のペアを扱うことができるが、例が少ない、つまりショット数が少ないアプローチを提案し、このアプローチが実際の低リソース言語である Yor\`ub\ でのマルチモーダルな少数ショット学習にどのように適用できるかを示します。
「あ。
私たちのアプローチには、指定された単語と画像の例のペアを使用して、ラベルのない音声と画像の大規模なコレクションから新しい教師なしの単語と画像のトレーニングペアをマイニングすることが含まれます。
さらに、単語から画像への注意メカニズムを使用して、単語と画像の類似性を判断します。
この新しいモデルでは、既存の英語ベンチマークにおける以前のアプローチよりも少ないショットで優れたパフォーマンスを実現します。
モデルの間違いの多くは、同様のコンテキストで同時に発生する視覚的な概念間の混乱が原因です。
Yor\`ub\’a の実験では、大規模な英語の音声画像データでトレーニングされたマルチモーダルモデルから知識を伝達する利点が示されています。

要約(オリジナル)

We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. Moreover, all previous studies were performed using English speech-image data. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yor\`ub\’a. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark. Many of the model’s mistakes are due to confusion between visual concepts co-occurring in similar contexts. The experiments on Yor\`ub\’a show the benefit of transferring knowledge from a multimodal model trained on a larger set of English speech-image data.

arxiv情報

著者	Leanne Nortje,Dan Oneata,Herman Kamper
発行日	2024-04-18 17:36:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visually grounded few-shot word learning in low-resource settings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー