a bit of context: @thecrazygmx recently created a webapp (probably) that shows hive posts. Then, noticed sudden activity. @alonicus suggested that may be AI data scrapers -> link
My guess was exactly the same, even before reading @alonicus comment :) Networking isn't cheap and content tends to disappear from the network. AI training companies won't be training their LLMs online. That would be inconsistent, not repeatable, and just .. so goddamn slow.
You can get content from online websites. When a new unaffiliated website show up, apparently with a ton of new unindexed content - I guess-bet the first thing they do, they actually copy it whole, whatever usable they can automatically fetch.
Then, it probably gets indexed, categorized/annotated, deduplicated, because there no gain in processing yet another StackOverflow shadow, right? And then, whatever's left, is added to internal datastore, the big bag of everything.
Once they have it 'locally', they can repeatably reuse it as many times as they want, limited only by their own 'local' storage speed. No network issues. No sudden 404/405/429s. No surprises.
But what's surprising is the number of unique IPs.. well, >150k is lot, though thecrazygm only "a short time frame" can vary, I trust that's reasonably "short", not a few years or something :)
Aside from rotating and randomizing the IPs to avoid simple detection, I can't think of anything why would they do it, but that's pretty enough of a reason, I guess, considering current anti-AI wave in writers/artists/etc circles. Same for the use of VPNs. Got locked on one IP pool for this site? ok, slow down a bit more, change VPN provider to one not used yet, and go on, just slower. That scraping strategy and evasion could even be automated, right?
I wouldn't be surprised if data collection was already outsourced long time ago, like gathering credit card or PII data, no financial criminals do it themselves, they just buy fresh databases. Then, for a company/individual who tries to make money on hunting for new data for AI trainers - it's seems just too good idea to not do it that or similar way.
That's my guess. I didn't work for any of those, so I can't tell from own experience. But I simply can't imagine them doing it much differently.