Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

要約

私たちの目的は、連続的な手話を音声言語テキストに翻訳することです。
人間の通訳者がコンテキストに依存して正確な翻訳を行う方法にインスピレーションを得て、追加のコンテキストキューを署名ビデオとともに新しい翻訳フレームワークに組み込みます。
具体的には、入力ビデオをエンコードする視覚サイン認識機能に加えて、(i) 背景ショーを説明するキャプション、(ii) 前の文の翻訳、および (iii) サインを転写する疑似光沢からの補完的なテキスト情報を統合します。
これらは自動的に抽出され、視覚的特徴とともに事前トレーニングされた大規模言語モデル (LLM) に入力され、テキスト形式で音声言語の翻訳を生成するように微調整されます。
広範なアブレーション研究を通じて、各入力キューが翻訳パフォーマンスに積極的に寄与していることを示しています。
私たちは、現在利用可能な最大の英国手話データセットである BOBSL でアプローチをトレーニングし、評価しています。
私たちの状況に応じたアプローチは、BOBSL に関する以前に報告された結果と比較して、またベースラインとして実装した最先端の方法と比較して、翻訳の品質を大幅に向上させることを示します。
さらに、米国手話データセットである How2Sign にも適用することでアプローチの一般性を実証し、競争力のある結果を達成しました。

要約(オリジナル)

Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL — the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.

arxiv情報

著者	Youngjoon Jang,Haran Raajesh,Liliane Momeni,Gül Varol,Andrew Zisserman
発行日	2025-01-16 18:59:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー