RE: LeoThread 2024-09-01 09:25

Getting data is now such a priority that pretty much anything goes, as this article in The New York Times explains, “Four takeaways on the race to amass data for A.I.”, which breaks down in a visual representation all the data used to train ChatGPT3, illustrating the magnitude of the data obtained since 2007 all over the internet using crawlers, which represent about 410 billion tokens, compared to the 3 billion tokens represented by the entirety of Wikipedia. On the other hand, book scanning involves a pair of collections of 12 billion and 55 billion tokens about which the company gives very little data and which are supposed to be millions of published books; or the 19 ,billion tokens obtained from Reddit by selecting those that have received three or more positive votes as an indicator of quality.