Sincerity Project Update
Over the last week I've been integrating some of the data from the SteemPlus crowdsourcing process to expand the amount of training data I can use to teach the machine learning algorithm how to distinguish spammers and bots from human content creators.
I'm fairly new to machine learning, and when somebody (sorry I can't remember who) mentioned random forest classifiers a couple of weeks ago, I decided to investigate them more. Following that, I have also now changed part of the classifier from using a nearest neighbours algorithm to a random forest classifier.
In terms of how well the software predicts the training data (cross-validation), these factors have resulted in a slight improvement in accuracy. The 'false positives' have also been reduced, meaning that fewer non-spammers should be wrongly labelled as spammers in SteemPlus and other services using the Sincerity spam API. Obviously people have different ideas of what a spammer is though, so this tool can at best reflect some kind of community average.
I am also now collecting extra data which will be ready to incorporate at the end of the month, which seem likely to further improve accuracy.
By that time, I would also like to have collected even more crowdsourced data for adding to the training set. I am very happy with the software, but the training data could still be better.
The trouble with allowing people to anonymously report spammers is that some people seem to use it as a way to try and remove content that they just don't like or understand. For example, many non-english language accounts were labelled incorrectly as spammers, as were some popular youtubers who have recently joined the platform, perhaps with divisive content.
I am thinking about better ways to collect crowdsourced data, feel free to let me know if you can help, or have any ideas. If you are in a position to delegate some SP for me to upvote comments with useful training data, that could be very useful.
Current Training Data
Whilst I can't reveal full details about the classification algorithm for a couple of reasons, the training data I used is shown here (and will be kept up to date when retraining happens). I don't have time to check all these accounts myself, but if you do, and find any inaccuracies, that is very helpful information for improving the classifier. One incorrectly classified account here could affect the software classification of many accounts!
Pending changes (changes I plan before the next software training)
Thanks to the validation by @fraenk, I've updated the changes list.
To be removed from all lists:
arcange
dailytop10open
jehovahwitness
new-york
altobot
austrobot
dailypick
To relabel as bots:
coin.info
followforupvotes
tts
New APIs Methods Coming Soon
Project Sincerity isn't just about spam and bot classification though. More of the large amount of data being collected will soon be available in the form of new APIs which relate to characteristics of voting, commenting, etc.
I am going through some of the "human"-classified accounts to check whether those contain some potentially misleading data-sets. I'll collect the results in comments below:
DISCLAIMER: the following interpretations are only MY subjective opinion, nothing else.
P.S.: this starts looking a bit spammy in and of itself, sorry, did'n expect this to be so many so instantly...
P.P.S.: I also started looking at the "spammer" classified training data, lot's of humans and bots in there (imho)... the training data seems to me like it could do with a much more thourough vetting process!
andybets - (&steemreports) has the right classification as human, but to be honest, you should probably remove your own account from the training data and see how your own AI ranks yourself, just to get a first-hand feel for it...
I would remove this account from the training-set to avoid any subjective in-house-biasing
Thanks for this. It's very helpful, I'll review your suggested changes.
I agree with all the changes you suggested, and have adjusted the data sets. :)
Awesome, I am glad I could help out!
As mentioned above, I believe there's also quite some "false positives" under the training-spammers, too... I'll go through some more of those when I find the time.
That'd be great. I have just sent you 2 SBD as a small thanks.
dailytop10open - might be operated by a human, but the pattern looks more like a bot-classification to me, very repetitive content and the comments are primarily "functional"
I would reclassify as bot or just remove it from the training set to avoid ambiguity
jehovahwitness - ok, i might be biased, but looking at their comments close up reveals the same set of a dozen or so "inspirational" comments being repeated over and over, I think it's questionable if this is actually a human and it may even be seen as spam by some.
i would remove this from the training data due to it's ambiguity
new-york - I think this is without a doubt spam, and probably bot-spam! The same identical "promotion" comment for a "resteem-service" is being posted over and over and over
I would reclassify this as spam
altobot - a self-proclaimed bot posting "manual" reports, probably it should be seen as more bot than human?!
I would remove this from the training data due to it's high ambiguity
austrobot - self-proclaimed trailing bot that posts manual content (?)
I would remove this from the training data due to ambiguity
coin.info - definitely a bot, has no original content leaves comments notifying of crypto rates of coins mentioned in the original posts
I would reclassify this as a bot
dailypick - curation service, might be manual, could be automated, repetitive comments look very bot-like
I would either reclassify this as a bot or remove from the training data to avoid ambiguity.
followforupvotes - self-proclaimed voting bot random voting it's followers and leaving repetitive comments and posts. No question this is a bot
I would reclassify this as a bot
You have me more than curious! Keep up the great work, your tools are already essential to "screening" suspicious activities.
I commend your progress on making this community more transparent!
This is really a great project. A good classification is so hard to achieve, even for humans.
Thank you so much from removing me from the lists.
I'm stoked to see its progress!
So, Skynet next? :)
@tipu upvote this post with 0.5 sbd
Hey @cardboard, I love smartness's @tipu! ;)
I just joined it: thank you for this innovative service!
It's awesome to see this make progress and improve in accuracy ...
It still leaves me with some worries when I check the current Top-Spammers according to the sincerity API:
While some of these accounts are in fact leaving very repetitive comments that may well be seen as spam... they are certainly lacking the volume to be in the ranks of "top-spammers".
At least that's my subjective interpretation of how to define spam... quantity does play a major role here.
Taking into account that there are accounts like @a-0-0 leaving 27k comments in the same timeframe, I think something should be done on that aspect.
OR, if the API purely want's to classify, it maybe just shouldn't publish a "ranking"?!
It's a fair point. I will also add a list of accounts sorted by the most comments made.
I guess that because accounts like this which already have a negative rep, probably aren't interfering with most people's experiences anymore, so aren't being reported as spammers by the community. This software is increasingly using a community average of spammer as its classification definition.
Great!
I think that's exactly what we need and I am stoked to see this increasing in "accuracy" of reflecting that.
The classification score for those "top-ranked" spammers does not feel inaccurate to me to be honest, but calling those the top-spammers is taking the result a bit out of context imho.
Is there an irony that your account shows 0.7% spammer and 0.2% bot? ;)
It's gone down to 0.6% spammer now because I've been trying not to spam! ;)
Hi @andybets! You have received 0.5 SBD @tipU upvote from @cardboard !
@tipU! upvotes with 200% profit and pays 100% profit + 50% curation rewards to investors :)
the SadKitten bot needs to be stopped it only helps the whales, everyone knows that,AND when any post has a upvote that shows a amount besides zero that the post will most like then be viewed by someone and maybe even upvoted, and everyone who does up vote get a slice of the pie, the down votes hurt not only the person who upvotes their own post but others who also get a part of it. and when the coins are devided up by steem the only one who get any are the select inner circle aka the whales, which is the only reason that the sadkitten bot is really their, why not flag all the other bots out there who always upvote the whales the ip and vpn always show that the bots are from select whales even now some whales are now changing often the ip and vpn to cover ther butt, steem was a good idea at first then the greed and bots and now even more sad kitten proves it greed and censorship to keep the peasent poor while the rich get richer, greed is greed and nothing more and it will be the death of steem as people flee and will not want steem