AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

要約

近年、表現学習と言語モデルの進歩により、自動キャプション (AC) が新たな高みに到達し、人間レベルの記述の生成が可能になりました。
これらの進歩を活用して、私たちは、オーディオビジュアルキャプションフレームワークである AVCap を提案します。これは、オーディオビジュアルキャプションに適用できるシンプルかつ強力なベースラインアプローチです。
AVCap はオーディオビジュアル機能をテキストトークンとして利用するため、パフォーマンスだけでなく、モデルの拡張性やスケーラビリティにおいても多くの利点があります。
AVCap は、最適なオーディオビジュアルエンコーダアーキテクチャの探索、生成されたテキストの特性に応じた事前トレーニング済みモデルの適応、およびキャプションにおけるモダリティフュージョンの有効性の調査という 3 つの重要な側面を中心に設計されています。
私たちの方法は、すべての指標において既存のオーディオビジュアルキャプション方法よりも優れており、コードは https://github.com/JongSuk1/AVCap で入手できます。

要約(オリジナル)

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is available on https://github.com/JongSuk1/AVCap

arxiv情報

著者	Jongsuk Kim,Jiwon Shin,Junmo Kim
発行日	2024-07-11 02:38:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー