Foundation Model-Driven Framework for Human-Object Interaction Prediction with Segmentation Mask Integration

要約

この作業では、セグメンテーションベースのビジョンファンデーションモデルと従来の検出ベースのヒューマンオブジェクトインターション（HOI）メソッドと区別される、セグメンテーションベースのビジョンファンデーションモデルとヒトオブジェクト相互作用タスクを統合する新しいフレームワークである、ヒトとオブジェクトの相互作用（\ textIT {\ textbf {seg2hoi}}）アプローチにセグメンテーションを紹介します。
私たちのアプローチは、標準的なトリプレットを予測するだけでなく、ヒトオブジェクトのペアにセグメンテーションマスクを含めることでHOIトリプレットを拡張するQuadrupletを導入することにより、HOIの検出を強化します。
より具体的には、SEG2HOIはVision Foundationモデルの特性（迅速なインタラクティブメカニズムなど）を継承し、これらの属性をHOIタスクに適用するデコーダーを組み込みます。
HOIのみのトレーニングにもかかわらず、これらのプロパティの追加トレーニングメカニズムがない場合、このフレームワークは、そのような機能がまだ効率的に動作していることを示しています。
2つのパブリックベンチマークデータセットでの広範な実験は、SEG2HOIがゼロショットシナリオであっても、最先端の方法に匹敵するパフォーマンスを達成することを示しています。
最後に、SEG2HOIは、トレーニング中に使用されていない新しいテキストと視覚的なプロンプトからHOI QuadrupletsとインタラクティブなHOIセグメンテーションを生成できることを提案し、この柔軟性を活用することで幅広いアプリケーションに汎用性があります。

要約(オリジナル)

In this work, we introduce Segmentation to Human-Object Interaction (\textit{\textbf{Seg2HOI}}) approach, a novel framework that integrates segmentation-based vision foundation models with the human-object interaction task, distinguished from traditional detection-based Human-Object Interaction (HOI) methods. Our approach enhances HOI detection by not only predicting the standard triplets but also introducing quadruplets, which extend HOI triplets by including segmentation masks for human-object pairs. More specifically, Seg2HOI inherits the properties of the vision foundation model (e.g., promptable and interactive mechanisms) and incorporates a decoder that applies these attributes to HOI task. Despite training only for HOI, without additional training mechanisms for these properties, the framework demonstrates that such features still operate efficiently. Extensive experiments on two public benchmark datasets demonstrate that Seg2HOI achieves performance comparable to state-of-the-art methods, even in zero-shot scenarios. Lastly, we propose that Seg2HOI can generate HOI quadruplets and interactive HOI segmentation from novel text and visual prompts that were not used during training, making it versatile for a wide range of applications by leveraging this flexibility.

arxiv情報

著者	Juhan Park,Kyungjae Lee,Hyung Jin Chang,Jungchan Cho
発行日	2025-04-28 14:45:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Foundation Model-Driven Framework for Human-Object Interaction Prediction with Segmentation Mask Integration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー