One of the first things I wanted to do when I got here was to use AI to classify spam/ham and other things. I found this was extremely difficult to the lack of quality on almost every action. For example, votes, follows, reblogs, all cannot be trusted as they are heavily financially modivated and are not organic. Content uses many lanugages, is frequently spun/plagiarized (more so on Steem, less on Hive now days, but still happens).
I think on Hive is far more viable, but still difficult.
I am curious what specific questions you are looking to answer with AI. My initial goal was to automatically detect spam it just wasn't practical. I thought of a few other things like recommendation engine, rep algorithm replacement, and other things but due to the low quality metrics, it just wasn't viable.
Right now we use upvotes for a start, but then transist to site usage data over time.
How useful it is when using on-chain metrics depends on the individual user, my feed looks a lot like a daily curangel compilation for example.
For the finished product I want the financials to disappear in the background anyways, at least initially. It's not targeting hive power users, but regular social media folks.
Upvotes are not a good signal as many are done blindly and very few votes are actually organic. It basically becomes a random number generator.
I obviously am aware of how people vote. And again, it depends on the user. There are people who vote organically, the others are not our targeted demographic.
The end goal is a site that's used like traditional social media to attract normal people who don't care about optimizing for 3 cents.
That's going to be a small sample size and lack generalization.
You're generalizing using the current userbase, which is not the intention here. Hive clearly failed to attract mainstream, and the main reason is the focus on rewards. People want to be entertained on social media, not start a new job. Content discovery is key there.
If it is for content discovery, the targets you are looking at using for data points (organic manual voters) would mean they have already been discovered.
If you don't get it, I don't have the time to explain. Keep doing what you're doing and see where that brings us.