We have now reached the point where some companies are starting to use synthetic data taken from other algorithms. This is a risky practice, because it can lead to errors that are consolidated throughout the different training and inference processes, but because in theory it offers unlimited potential and will be difficult to regulate, it’s an attractive proposition for some players.
Algorithms generating data to train other algorithms? We’re approaching
Christopher Nolan territory here. In the meantime, companies working on generative algorithms will continue to cut deals with newspapers and any other organization capable of generating data. Machine learning was about working with data to obtain access to archives, eliminate unjustified outliers which produced efficient models; now we are in a full-on phase in which the only thing that matters is that the resulting algorithm seems to be of reasonable quality, without asking too many questions.