Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

要約

食品の画像に基づいた自動食事評価は依然として課題であり、正確な食品検出、セグメンテーション、および分類が必要です。
ビジョン言語モデル（VLM）は、視覚とテキストの推論を統合することにより、新しい可能性を提供します。
この研究では、6つの最先端のVLMS（ChatGpt、Gemini、Claude、Moondream、Deepseek、およびLlava）を評価し、さまざまなレベルでの食品認識の能力を分析します。
実験フレームワークについては、10のカテゴリ（「タンパク質ソース」など）にわたって9,263の専門家標識画像を含むユニークな食品画像データベース、62のサブカテゴリ（「家禽」など）、9つの料理スタイル（例えば、「グリル」）を含むfoodNextDBを紹介します。
合計で、FoodNextDBには、データベース内のすべての画像を手動で注釈させた7人の専門家によって生成された50kの栄養ラベルが含まれています。
また、アノテーター間の変動を説明する新しい評価メトリック、専門家加重リコール（EWR）を提案します。
結果は、クローズドソースモデルがオープンソースのモデルよりも優れており、単一の製品を含む画像の食品を認識する際に90％以上のEWRを達成することを示しています。
その可能性にもかかわらず、現在のVLMは、特に調理スタイルと視覚的に類似した食品の微妙な違いを区別することで、きめ細かい食物認識において課題に直面しています。
foodnextdbデータベースは、https://github.com/ai4food/foodnextdbで公開されています。

要約(オリジナル)

Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., ‘protein source’), 62 subcategories (e.g., ‘poultry’), and 9 cooking styles (e.g., ‘grilled’). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

arxiv情報

著者	Sergio Romero-Tapiador,Ruben Tolosana,Blanca Lacruz-Pleguezuelos,Laura Judith Marcos Zambrano,Guadalupe X. Bazán,Isabel Espinosa-Salinas,Julian Fierrez,Javier Ortega-Garcia,Enrique Carrillo de Santa Pau,Aythami Morales
発行日	2025-04-09 14:33:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー