How to Use AI to Find Articles with Ruby

in #radiator8 years ago (edited)

Note: This is pure magic and highly experimental. In a nutshell, we're going to look a the trending page and try to predict which new posts will reach trending. To do this, we're going to use ID3. According to Wikipedia:

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains.

ID3 algorithm

In Ruby, we can use the ID3 algorithm through the ai4r gem.

Ok, it's not really magic. So, how does it work? I have ID3 look at some specific attributes of top 100 trending posts. Specifically:

author_reputation percent_steem_dollars promoted category net_votes

Based on these attributes, I have it predict total_pending_payout_value of a new post. If total_pending_payout_value can be predicted, we will display the difference between the prediction and the current pending payout.

As always, we use Radiator with bundler. You can get bundler with this command:

$ gem install bundler

I've tested it on various versions of ruby. The oldest one I got it to work was:

ruby 2.0.0p645 (2015-04-13 revision 50299) [x86_64-darwin14.4.0]

First, make a project folder:

$ mkdir radiator
$ cd radiator

Create a file named Gemfile containing:

source 'https://rubygems.org'
gem 'radiator', github: 'inertia186/radiator'
gem 'ai4r' # Adds general machine learning capabilities.

Then run the command:

$ bundle install

Create a file named ai-scan.rb containing:

require 'rubygems'
require 'bundler/setup'

Bundler.require

def to_rep(raw)
  raw = raw.to_i
  level = Math.log10(raw.abs)
  level = [level - 9, 0].max
  level = (level * 9) + 25
  level.to_i
end

def base_value(raw)
  raw.split(' ').first.to_i
end

def symbol_value(raw)
  raw.split(' ').last
end

api = Radiator::Api.new
names = ARGV
data_labels = %w(
  author_reputation percent_steem_dollars promoted category net_votes
  total_pending_payout_value
)
prediction_label = data_labels.last

options = {
  limit: 100
}

options[:tag] = ARGV.first if ARGV.any?

response = api.get_discussions_by_trending(options)
trending_comments = response.result

data_items = trending_comments.map do |comment|
  data_labels.map do |label|
    case label
    when 'author_reputation'; to_rep comment[label]
    when 'promoted'; base_value comment[label]
    when 'total_pending_payout_value'; base_value comment[label]
    else; comment[label]
    end
  end
end

data_set = Ai4r::Data::DataSet.new data_labels: data_labels, data_items: data_items
id3 = Ai4r::Classifiers::ID3.new.build(data_set)

response = api.get_discussions_by_created(options)
new_comments = response.result - trending_comments
 
predictions = new_comments.map do |comment|
  next unless comment.mode == 'first_payout'

  data_item = data_labels.map do |label|
    case label
    when 'author_reputation'; to_rep comment[label]
    when 'promoted'; base_value comment[label]
    when 'total_pending_payout_value'; base_value comment[label]
    else; comment[label]
    end
  end

  prediction = (id3.eval(data_item) rescue nil)

  next if prediction.nil?

  {
    difference: prediction - base_value(comment.total_pending_payout_value),
    symbol: symbol_value(comment.total_pending_payout_value),
    url: "https://steemit.com#{comment.url}"
  }
end.reject(&:nil?)

if predictions.any?
  puts "Predicting the following payouts will rise by:"
  predictions.sort_by { |p| p[:difference] }.each do |prediction|
    puts "#{prediction[:difference]} #{prediction[:symbol]}: #{prediction[:url]}"
  end
else
  puts "Nothing to predict."
end

Then run it:

$ ruby ai-scan.rb

The expected output will be something like this:

Predicting the following payouts will rise by:
0 SBD: https://steemit.com/history/@steemizen/today-in-history-uss-arkansas
0 SBD: https://steemit.com/steem/@ozchartart/usdsteem-btc-daily-poloniex-bittrex-technical-analysis-market-report-update-162-jan-14-2017
10 SBD: https://steemit.com/travel/@writingamigo/traveler-s-observations-the-origins-of-habits-how-environement-forces-us-to-believe-that-it-is-our-fault
13 SBD: https://steemit.com/fiction/@johnjgeddes/tempest-and-tea-rediscovering-the-magic-within-part-1-of-2
15 SBD: https://steemit.com/travel/@exploretraveler/photo-of-the-day-skagway-alaska
17 SBD: https://steemit.com/news/@contentjunkie/spacex-launches-first-rocket-since-explosion
17 SBD: https://steemit.com/food/@anti-sophist/bold-lamb-loin-chops-and-basil-potatoes-2017114t195031380z
17 SBD: https://steemit.com/pizzagate/@gizmosia/the-video-the-world-must-watch-chilling-info-re-child-trafficking-posted-today
17 SBD: https://steemit.com/minecraft/@thedonutguy7/how-to-download-a-minecraft-map-for-windows
17 SBD: https://steemit.com/fly/@altcointrader77/flycoin-in-the-hands-of-a-trusted-few
17 SBD: https://steemit.com/fiction/@internutter/challenge-01476-d015-historical-hysterical-first
17 SBD: https://steemit.com/animal/@favorit/nature-that-surrounds-us-in-the-animal-world-black-stallion-23
18 SBD: https://steemit.com/film/@movie-online/confidential-secret-market-1974-romance-history
18 SBD: https://steemit.com/life/@lukestokes/day-6-update-the-wim-hof-method
18 SBD: https://steemit.com/kr/@leesunmoo/6r1hns
19 SBD: https://steemit.com/challenge30/@franks/challenge30-deep-space-mining-unobtainium

You can also pass a tag:

$ ruby ai-scan.rb photography

The expected output will be something like this:

Predicting the following payouts will rise by:
0 SBD: https://steemit.com/travel/@koskl/visiting-cusco-peru
0 SBD: https://steemit.com/nature/@zaskia/beautiful-flower
0 SBD: https://steemit.com/photography/@distantsignal/shooting-milkshake-web-series-on-vintage-russian-lenses
0 SBD: https://steemit.com/photography/@chrissysworld/the-sky-burns-the-angels-flee-der-himmel-brennt-die-engel-fliehn-english-deutsch
0 SBD: https://steemit.com/photography/@klava/white-truffle
0 SBD: https://steemit.com/photography/@rynow/sunken-fish-trailer
0 SBD: https://steemit.com/food/@lonilush/traditional-balkan-cheese-pie-burek-original-recipe-with-pictures
0 SBD: https://steemit.com/nature/@riostarr/mushrooms-on-dead-wood
1 SBD: https://steemit.com/photography/@richar/life-and-death-on-wall-street
1 SBD: https://steemit.com/photography/@xntryk1/swapmeet-finds-640
5 SBD: https://steemit.com/photography/@jasonrussell/jacks-fork-river-10-pictures
5 SBD: https://steemit.com/photography/@kalemandra/reflections
17 SBD: https://steemit.com/photography/@briansss/check-it-out-my-photo-album-of-my-trip-through-venezuela
17 SBD: https://steemit.com/food/@alizee/pecal-tubers-vegetables-papaya-flower

Either way, you can use these results as voting suggestions because the ID3 algorithm thinks these articles correlate to a future payout prediction.

Under the hood, here's a rough explanation of what's going on. We take the trending posts, and just extract certain fields as inputs to ID3. The inputs become:

author_reputationpercent_steem_dollarspromotedcategorynet_votestotal_pending_payout_value
52100000romance14616
58100000story16016
6700science16216
58100000travel17816
60100000gaming16616
54100000fiction14115
54100000food16315
53100000art16715
6700japan10815
61100000poker2115
59100000til15815
63100000music16515
60100000art16015
59100000aceh15515
59100000writing14715
55100000life16015
51100000painting14815
5701life13015
59100000travel16315

ID3 takes the above inputs and then compares them all to each new post, looking for correlations. Then it tries to predict the final total_pending_payout_value for the new posts.

For instance, it might notice that authors with a reputation of 59, posting in til, tend to have a total_pending_payout_value of 15. So if a new post matches, it'll make that prediction.

But then, it notices a correlation between certain percent_steem_dollars, promoted, and category posts, but only when the category is science. It's that flexible.

As an analogy, it's a little bit like weather prediction: "In this area, on this day, for the last 100 years, when the temperature is x and the humidity is y, it rains z percent of the time."

You will notice, I specifically exclude the author name from the prediction inputs. If you want to include it, you can add it yourself by modifying data_labels in the script and adding author to the beginning.

While including author might help ID3 make better predictions, personally, I'm not interested in correlating the author name. We already have enough of those kinds of tools (albeit, without ID3). I want ID3 to be indifferent about the author and try to make its prediction on a more subtle inputs, which is what it's designed to do.

ruby

See my previous Ruby How To posts in: #radiator #ruby

Sort:  

Interesting.

Cool post. Did you measure corelation between predicted payout and the real one?

I'm still looking at it. When I originally posted this post, my script said I would earn $17. Then, 5 minutes later, it couldn't make any more predictions about this post.

The other samples in this post seem to correlate a little better than chance, on cursory analysis. I'll do a more in-depth post later.

Very helpful post! Interesting too.