SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

要約

ビデオシーングラフ生成（VIDSGG）は、動的なキッチン環境を理解する上で重要なトピックです。
VIDSGGの現在のモデルでは、シーングラフを作成するために広範なトレーニングが必要です。
最近、Vision Language Models（VLM）およびVision Foundation Models（VFM）は、さまざまなタスクで印象的なゼロショット機能を実証しています。
ただし、GeminiのようなVLMは、Vidsggのダイナミクスと格闘しており、フレーム全体で安定したオブジェクトのアイデンティティを維持できません。
この制限を克服するために、Sam2の時間的追跡とGeminiの意味的理解を組み合わせたゼロショットパイプラインであるSamjamを提案します。
SAM2は、より正確な境界ボックスを生成することにより、ジェミニのオブジェクトの接地を改善します。
この方法では、最初にジェミニにフレームレベルのシーングラフを生成するように促します。
次に、一致するアルゴリズムを使用して、SAM2で生成されたマスクまたはSAM2プロパゲーションのマスクでシーングラフの各オブジェクトをマッピングし、動的環境で一時的に無意味なシーングラフを作成します。
最後に、次の各フレームでこのプロセスを再度繰り返します。
Samjamは、Epic-KitchensおよびEpic-Kitchens-100データセットの平均リコールでGeminiを8.33％上回ることを経験的に実証しています。

要約(オリジナル)

Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2’s temporal tracking with Gemini’s semantic understanding. SAM2 also improves upon Gemini’s object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.

arxiv情報

著者	Joshua Li,Fernando Jose Pena Cantu,Emily Yu,Alexander Wong,Yuchen Cui,Yuhao Chen
発行日	2025-04-10 15:43:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー