A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

要約

この研究では、手話翻訳 (SLT) と手話制作 (SLP) の両方における光沢の使用に関連する課題に取り組んでいます。
グロスは手話と音声言語の間の橋渡しとして長い間使用されてきましたが、グロスには手話システムの進歩を妨げる 2 つの大きな制限があります。
まず、グロスへの注釈付けは労力と時間がかかるプロセスであり、データセットのスケーラビリティが制限されます。
第二に、グロスは手話の時空間的ダイナミクスを取り除き、複雑な手話を基本的なラベルに落とし込み、正確な解釈に不可欠な微妙な動きを見逃してしまうことで、手話を過度に単純化します。
これらの制限に対処するために、手話に固有の時空間的特徴を捉えるように設計されたフレームワークである Universal Gloss-level Representation (UniGloR) を導入し、光沢の使用に代わるより動的で詳細な代替手段を提供します。
UniGloR の核となるアイデアはシンプルですが効果的です。自己教師あり学習を使用してサインキーポイントシーケンスから高密度の時空間表現を導き出し、それらを SLT および SLP タスクにシームレスに統合します。
キーポイントベースの設定での実験では、UniGloR が 2 つの広く使用されているデータセット、PHOENIX14T と How2Sign で以前の SLT および SLP 手法のパフォーマンスを上回るか、同等のパフォーマンスを発揮することを実証しました。

要約(オリジナル)

This work addresses the challenges associated with the use of glosses in both Sign Language Translation (SLT) and Sign Language Production (SLP). While glosses have long been used as a bridge between sign language and spoken language, they come with two major limitations that impede the advancement of sign language systems. First, annotating the glosses is a labor-intensive and time-consuming process, which limits the scalability of datasets. Second, the glosses oversimplify sign language by stripping away its spatio-temporal dynamics, reducing complex signs to basic labels and missing the subtle movements essential for precise interpretation. To address these limitations, we introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture the spatio-temporal features inherent in sign language, providing a more dynamic and detailed alternative to the use of the glosses. The core idea of UniGloR is simple yet effective: We derive dense spatio-temporal representations from sign keypoint sequences using self-supervised learning and seamlessly integrate them into SLT and SLP tasks. Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches the performance of previous SLT and SLP methods on two widely-used datasets: PHOENIX14T and How2Sign.

arxiv情報

著者	Eui Jun Hwang,Sukmin Cho,Huije Lee,Youngwoo Yoon,Jong C. Park
発行日	2024-12-04 13:41:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー