FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

要約

レシピデータを用いた食品イメージの理解に関する研究は、そのデータの多様性と複雑性から、長年にわたって注目されてきた。さらに、食は人々の生活と切っても切れない関係にあるため、食事管理などの実用化においても重要な研究分野となっている。最近のマルチモーダル大規模言語モデル（MLLM）の進歩は、その膨大な知識だけでなく、言語を自然に扱う能力においても、驚くべき能力を示している。英語が主に使われているが、日本語を含む複数の言語にも対応している。このことから、MLLMは食品画像理解タスクの性能を大幅に向上させることが期待される。我々は、オープンMLLMであるLLaVA-1.5とPhi-3 Visionを日本語のレシピデータセット上で微調整し、クローズドモデルであるGPT-4oに対する性能をベンチマークした。そして、日本の食文化を網羅した5,000の評価サンプルを用いて、生成されたレシピの材料や調理手順などの内容を評価した。その結果、レシピデータで学習させたオープンモデルは、食材生成において、現在の最新モデルであるGPT-4oを上回ることが実証された。GPT-4oのF1スコア0.481を上回るF1スコア0.531を達成し、より高い精度を示しました。さらに、調理手順文の生成においてもGPT-4oと同等の性能を示した。

要約(オリジナル)

Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people’s lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o’s F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.

arxiv情報

著者	Yuki Imajuku,Yoko Yamakata,Kiyoharu Aizawa
発行日	2025-03-03 15:04:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー