Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion

要約

キャラクターを認識し、会話の話者を予測することは、音声生成や翻訳などのコミック処理タスクにとって重要です。
ただし、キャラクターは漫画のタイトルによって異なるため、漫画のタイトルごとに特定の注釈を必要とする文字分類器のトレーニングなどの教師あり学習アプローチは実行できません。
これが私たちに、注釈のない漫画画像のみに基づいて機械が登場人物を識別し、話者名を予測できるようにする、新しいゼロショットアプローチを提案する動機となっています。
現実世界のアプリケーションにおける重要性にもかかわらず、これらのタスクは、ストーリーの理解とマルチモーダルな統合における課題のため、ほとんど未調査のままです。
最近の大規模言語モデル (LLM) は、テキストの理解と推論に優れた能力を示していますが、マルチモーダルコンテンツ分析への適用は依然として未解決の問題です。
この問題に対処するために、我々は、文字識別タスクと話者予測タスクの両方にマルチモーダル情報を採用する最初の反復マルチモーダルフレームワークを提案します。
私たちの実験は、提案されたフレームワークの有効性を実証し、これらのタスクに対する堅牢なベースラインを確立します。
さらに、私たちの手法はトレーニングデータや注釈を必要としないため、あらゆる漫画シリーズにそのまま使用できます。

要約(オリジナル)

Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.

arxiv情報

著者	Yingxuan Li,Ryota Hinami,Kiyoharu Aizawa,Yusuke Matsui
発行日	2024-08-27 15:56:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー