Meta likewise bumped against the limits of good training data availability, and in recordings the Times heard, its AI team discussed its unpermitted use of copyrighted works while working to catch up to OpenAI. The company, after going through “almost available English-language book, essay, poem and news article on the internet,” apparently considered taking steps like paying for book licenses or even buying a large publisher outright. It was also apparently limited in the ways it could use consumer data by privacy-focused changes it made in the wake of the Cambridge Analytica scandal.
Google, OpenAI, and the broader AI training world are wrestling with quickly-evaporating training data for their models, which get better the more data they absorb. The Journal wrote this week that companies may outpace new content by 2028.