Sharingan: Extract User Action Sequence from Desktop Recordings

要約

ユーザーアクティビティのビデオ録画、特にデスクトップ録画は、ユーザーの行動を理解し、プロセスを自動化するための豊富なデータソースを提供します。
しかし、視覚言語モデル (VLM) が進歩し、ビデオ分析での使用が増加しているにもかかわらず、デスクトップ録画からのユーザーアクションの抽出は依然として未開発の領域です。
この論文では、ユーザーアクションを抽出するための 2 つの新しい VLM ベースの方法を提案することで、このギャップに対処します。1 つは、サンプリングされたフレームを直接 VLM に入力する直接フレームベースのアプローチ (DF)、もう 1 つは明示的なフレームベースのアプローチを組み込む差分フレームベースのアプローチ (DiffF) です。
コンピュータービジョン技術によってフレームの違いが検出されます。
私たちは、自己キュレーションされた基本的なデータセットと、以前の研究から適応された高度なベンチマークを使用して、これらの手法を評価します。
私たちの結果は、DF アプローチがユーザーアクションの識別において 70% ～ 80% の精度を達成し、抽出されたアクションシーケンスがロボットプロセスオートメーションを通じて再生可能であることを示しています。
VLM には可能性があるものの、明示的な UI 変更を組み込むとパフォーマンスが低下する可能性があり、DF アプローチの信頼性が高まることがわかりました。
この研究は、デスクトップ録画からユーザーアクションシーケンスを抽出するための VLM の最初のアプリケーションを表し、新しい手法、ベンチマーク、将来の研究のための洞察に貢献します。

要約(オリジナル)

Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.

arxiv情報

著者	Yanting Chen,Yi Ren,Xiaoting Qin,Jue Zhang,Kehong Yuan,Lu Han,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
発行日	2024-11-13 16:53:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sharingan: Extract User Action Sequence from Desktop Recordings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー