Towards Zero-Shot Multimodal Machine Translation

要約

現在のマルチモーダル機械翻訳（MMT）システムは、完全に監視されたデータに依存しています（つまり、モデルは翻訳とそれに付随する画像で文でトレーニングされています）。
ただし、このタイプのデータは収集するのに費用がかかり、MMTの拡張がそのようなデータが存在しない他の言語ペアに制限します。
この作業では、マルチモーダル英語データのみを使用して、MMTシステムをトレーニングするために完全に監視されたデータの必要性をバイパスする方法を提案します。
Zerommtと呼ばれるこの方法は、2つの目的の混合物でトレーニングすることにより、強力なテキストのみの機械翻訳（MT）モデルを適応させることで構成されています。視覚的に条件付けられたマスク言語モデリングと、元のMMT出力と新しいMMT出力の間のKullback-Leiblerの発散です。
標準のMMTベンチマークと最近リリースされた通勤で評価します。これは、モデルが画像を使用して英語の文章を明らかにする方法を評価することを目的とした対照的なベンチマークです。
完全に監視されている例でさらに訓練された最先端のMMTモデルの近くで、掘削障害のパフォーマンスを取得します。
私たちの方法が完全に監視されたトレーニングデータを利用できない言語に一般化することを証明するために、通勤評価データセットをアラビア語、ロシア語、中国語の3つの新しい言語に拡張します。
さらに、分類器のないガイダンスを使用して、追加データを使用して、推論時間で曖昧性の能力と翻訳の忠実度との間のトレードオフを制御できることを示します。
私たちのコード、データ、訓練されたモデルは公開されています。

要約(オリジナル)

Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.

arxiv情報

著者	Matthieu Futeral,Cordelia Schmid,Benoît Sagot,Rachel Bawden
発行日	2025-03-11 13:07:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Zero-Shot Multimodal Machine Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー