Characterizing Model Collapse in Large Language Models Using Semantic Networks and Next-Token Probability




As synthetic content increasingly infiltrates the web, generative AI models may experience an autophagy process, where they are fine-tuned using their own outputs. This autophagy could lead to a phenomenon known as model collapse, which entails a degradation in the performance and diversity of generative AI models over successive generations. Recent studies have explored the emergence of model collapse across various generative AI models and types of data. However, the current characterizations of model collapse tend to be simplistic and lack comprehensive evaluation. In this article, we conduct a thorough investigation of model collapse across three text datasets, utilizing semantic networks to analyze text repetitiveness and diversity, while employing next-token probabilities to quantify the loss of diversity. We also examine how the proportions of synthetic tokens affect the severity of model collapse and perform cross-dataset evaluations to identify domain-specific variations. By proposing metrics and strategies for a more detailed assessment of model collapse, our study provides new insights for the development of robust generative AI systems.


著者 Daniele Gambetta,Gizem Gezici,Fosca Giannotti,Dino Pedreschi,Alistair Knott,Luca Pappalardo
発行日 2025-02-02 22:40:09+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, DeepL

カテゴリー: cs.AI, cs.CL パーマリンク