Research published in June predicted that AI companies will exhaust available public human-generated text data between 2026 and 2032, marking a critical inflection point for traditional development approaches.
"Our findings indicate that current LLM development trends cannot be sustained through conventional data scaling alone," the research paper states, highlighting the need for alternative approaches to model improvement, including synthetic data generation, transfer learning from data-rich domains, and the use of non-public data.