ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

要約

ニューラルマシン翻訳（NMT）は、変圧器ベースのモデルを使用することで翻訳を改善しましたが、それでも単語のあいまいさとコンテキストに苦労しています。
この問題は、ドメイン固有のアプリケーションで特に重要です。ドメイン固有のアプリケーションは、不明確な文やデータ品質の低さに問題があることがよくあります。
私たちの研究では、モデルに情報を追加することで、eコマースデータのコンテキストで翻訳を改善する方法を探ります。
この目的のために、コネクトを作成します。これは、11,400ペアからなる画像と製品メタデータと組み合わせた新しいチェコからポリッシュな電子商取引製品翻訳データセットです。
次に、コンテキスト認識翻訳に適用できるさまざまな方法を調査および比較します。
ビジョン言語モデル（VLM）をテストし、視覚的なコンテキストが翻訳の品質を支援することを確認します。
さらに、製品のカテゴリパスや画像の説明など、テキストからテキストモデルへのコンテキスト情報の組み込みを検討します。
私たちの研究の結果は、コンテキスト情報の組み込みが機械翻訳の品質の改善につながることを示しています。
新しいデータセットを公開します。

要約(オリジナル)

Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT — a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product’s category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

arxiv情報

著者	Mikołaj Pokrywka,Wojciech Kusa,Mieszko Rutkowski,Mikołaj Koszowski
発行日	2025-06-09 14:39:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー