Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

要約

ビデオアクションのセグメンテーションと認識タスクは、多くの分野で広く適用されています。
これまでのほとんどの研究では、ビデオを包括的に理解するために、大規模で計算能力の高いビジュアルモデルが採用されています。
ただし、ビデオについて推論するためにグラフモデルを直接使用する研究はほとんどありません。
グラフモデルには、パラメーターが少なく、計算コストが低く、受容野が大きく、近隣メッセージの柔軟な集約という利点があります。
この論文では、ビデオアクションのセグメンテーションと認識の問題をグラフのノード分類に変換する、Semantic2Graph という名前のグラフベースの方法を紹介します。
ビデオのきめの細かい関係を維持するために、ビデオのグラフ構造をフレームレベルで構築し、3 種類のエッジ (時間的、セマンティック、自己ループ) を設計します。
ノード属性として、視覚的、構造的、および意味的な特徴を組み合わせます。
セマンティックエッジは長期的な時空間関係をモデル化するために使用されますが、セマンティック機能はテキストプロンプトに基づくラベルテキストの埋め込みです。
グラフニューラルネットワーク (GNN) モデルを使用して、マルチモーダル機能融合を学習します。
実験結果は、最先端の結果と比較して、Semantic2Graph が GTEA と 50Salads で改善を達成することを示しています。
複数のアブレーション実験により、モデルのパフォーマンスを向上させるセマンティック機能の有効性がさらに確認され、セマンティックエッジにより、Semantic2Graph は長期的な依存関係を低コストでキャプチャできます。

要約(オリジナル)

Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.

arxiv情報

著者	Junbin Zhang,Pei-Hsuan Tsai,Meng-Hsun Tsai
発行日	2022-11-16 13:26:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー