3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation

要約

3Dおよび2D視線の推定は、眼球運動をキャプチャするという基本的な目的を共有していますが、伝統的に2つの異なる研究ドメインとして扱われています。
この論文では、いくつかのトレーニング画像のみを使用して、目に見えないデバイスの2D視線予測のために事前に訓練された3D視線推定ネットワークを適応することを目指して、新しいクロスタスクのいくつかのショット2D視線推定アプローチを紹介します。
このタスクは、3Dと2Dの視線の間のドメインギャップ、未知の画面ポーズ、および限られたトレーニングデータの間で非常に困難です。
これらの課題に対処するために、3Dと2Dの視線の間のギャップを埋める新しいフレームワークを提案します。
私たちのフレームワークには、画面のポーズをモデル化し、3D視線を2D視線に投影するための学習可能なパラメーターを備えた物理ベースの微分プロジェクションモジュールが含まれています。
フレームワークは完全に微分可能であり、元のアーキテクチャを変更せずに既存の3D Gazeネットワークに統合できます。
さらに、フリップされた画像に動的な擬似ラベル戦略を導入します。これは、不明な画面ポーズのために2Dラベルにとって特に困難です。
これを克服するために、2Dラベルを3Dスペースに変換することにより、投影プロセスを逆転させます。
特に、この3Dスペースはカメラ座標系と一致していないため、この誤りを補うために動的変換マトリックスを学習します。
ラップトップ、デスクトップコンピューター、モバイルデバイスでそれぞれ収集されたMpiigaze、Eve、およびGazecaptureデータセットに関する方法を評価します。
優れたパフォーマンスは、アプローチの有効性を強調し、実際のアプリケーションの強力な可能性を示しています。

要約(オリジナル)

3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for this misalignment. We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices. The superior performance highlights the effectiveness of our approach, and demonstrates its strong potential for real-world applications.

arxiv情報

著者	Yihua Cheng,Hengfei Wang,Zhongqun Zhang,Yang Yue,Bo Eun Kim,Feng Lu,Hyung Jin Chang
発行日	2025-02-06 13:37:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー