The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

要約

翻訳トレインなど、翻訳を横断する転送XLTのための翻訳ベースの戦略 – ソース言語から翻訳された騒々しいターゲット言語データのトレーニング – および翻訳テスト – ターゲット言語から翻訳された騒々しいソース言語データの評価 – 競争力のあるXLTベースラインです。
ただし、トークン分類タスク用のXLTでは、これらの戦略には、ラベル投影が含まれます。これは、元のテークンの各トークンから翻訳のカウンターパートにラベルをマッピングするという挑戦的なステップです。
単語アライナー（WES）は一般的にラベル投影に使用されますが、翻訳ベースのXLTに適用するための低レベルの設計決定は体系的に調査されていません。
さらに、プロジェクトが翻訳の前（または後）の周りにタグを挿入することによってスパンとラベル付けされた最近のマーカーベースの方法は、XLTのラベル投影にあると主張しています。
この作業では、ラベル投影のためのものであり、トークンレベルのXLTに対する低レベルの設計上の決定の効果を体系的に調査しました。
これらはすべて、翻訳ベースのXLTパフォーマンスに実質的に影響を与えることがわかり、最適化された選択により、WAを備えたXLTは、少なくともマーカーベースの方法に匹敵するパフォーマンスを提供することを示しています。
次に、アンサンブルを翻訳し、テストの予測を翻訳し、マーカーベースの投影を大幅に上回ることを実証する新しい投影戦略を導入します。
重要なことに、提案されたアンサンミングは、低レベルのWA設計の選択に対する感度も低下させ、トークン分類タスクのXLTがより堅牢になることを示しています。

要約(オリジナル)

Translation-based strategies for cross-lingual transfer XLT such as translate-train — training on noisy target language data translated from the source language — and translate-test — evaluating on noisy source language data translated from the target language — are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

arxiv情報

著者	Benedikt Ebing,Goran Glavaš
発行日	2025-05-15 17:10:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー