HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

要約

中小規模のスケールでの人間の姿勢推定は、この分野において長い間重要な課題でした。
既存の手法のほとんどは、コストのかかる複数のデコンボリューション層を積み重ねたり、高解像度の特徴マップを維持しながら低解像度の特徴マップから意味論的な情報を継続的に集約したりすることによって、高解像度の特徴マップを復元することに重点を置いており、これにより情報の冗長性が生じる可能性があります。
さらに、量子化エラーにより、ヒートマップベースの方法には、中規模および小規模の人物のキーポイントを正確に特定する際に一定の欠点があります。
このペーパーでは、長距離依存関係をモデル化するバックボーンとして PVT v2 を利用する HRPVT を提案します。
これに基づいて、畳み込みニューラルネットワーク (CNN) の固有の帰納的バイアスを高解像度の特徴マップに組み込むことで、より高品質の高解像度表現を生成するように設計された高解像度ピラミッドモジュール (HRPM) を導入します。
HRPM の統合により、中小規模のスケールでの人間の姿勢推定のための純粋なトランスフォーマーベースのモデルのパフォーマンスが向上します。
さらに、ヒートマップベースの手法を SimCC アプローチに置き換えます。これにより、コストのかかるアップサンプリングレイヤーの必要がなくなり、より多くの計算リソースを HRPM に割り当てることができるようになります。
さまざまなパラメータースケールのモデルに対応するために、HRPM の 2 つの挿入戦略を開発しました。それぞれの戦略は、中規模および小規模の人間のポーズを 2 つの異なる視点から認識するモデルの能力を強化するように設計されています。

要約(オリジナル)

Human pose estimation on medium and small scales has long been a significant challenge in this field. Most existing methods focus on restoring high-resolution feature maps by stacking multiple costly deconvolutional layers or by continuously aggregating semantic information from low-resolution feature maps while maintaining high-resolution ones, which can lead to information redundancy. Additionally, due to quantization errors, heatmap-based methods have certain disadvantages in accurately locating keypoints of medium and small-scale human figures. In this paper, we propose HRPVT, which utilizes PVT v2 as the backbone to model long-range dependencies. Building on this, we introduce the High-Resolution Pyramid Module (HRPM), designed to generate higher quality high-resolution representations by incorporating the intrinsic inductive biases of Convolutional Neural Networks (CNNs) into the high-resolution feature maps. The integration of HRPM enhances the performance of pure transformer-based models for human pose estimation at medium and small scales. Furthermore, we replace the heatmap-based method with SimCC approach, which eliminates the need for costly upsampling layers, thereby allowing us to allocate more computational resources to HRPM. To accommodate models with varying parameter scales, we have developed two insertion strategies of HRPM, each designed to enhancing the model’s ability to perceive medium and small-scale human poses from two distinct perspectives.

arxiv情報

著者	Zhoujie Xu
発行日	2024-10-29 14:36:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー