Parsing RSS with Bash scripts

in #programming5 years ago


So I know there are lots of great ways to get RSS withing the browser and dedicated RSS apps on any platform. From Gnome, KDE to OSX and other embedded scripts on your favorite editor Emacs. However there is always some cool way to read your favorite news from your site.

So what our handy shell script can do to parse this news and show us our latest info from the internet.

We need to remind ourselves RSS is XML

So we can leverage some of the XML tools that there are. Another thing we should remember is that the schema is well known. So we would look for very specific tags and iterate through them.

The key tags for the RSS tree we want to pay attention are the following:

  • item
  • link
  • description
  • PubDate

The item tag will wrap around the RSS news and inside this item we would see the Title, PubDate, Description and link.

One of the big command we will use in Bash is the read command. We need to prepare our code to understand the concept of TAGS. Tags are words wrapped around the '< and > ' symbols.

Time for some code

The read command has important global variables like $IFS, if you look at the manual it says IFS has Internal File Separator, which is used to parse the tags by doing: local IFS='>'. And we go back to the start of the tag by doing read -d for the delimiter.

We can create a function so that we can conceptualize the tag.

identify_tags() {
   local IFS='>'
   read -d '<' TAG CONTENT
} 

With this function we can get the content into the iteration of how a tag will be parsed.

Now we use some loop over the file and see how many tags we can catch by using cat and while and echo.

cat $1 | while identify_tags ; do
   echo "<$TAG>{$CONTENT}"
done

With this, we can also have the use for tag and content. After executing our function through the loop we have the following output:

<>{}
<xml version="1.0" encoding="UTF-8" ?>{}
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">{}
<channel>{}
<title>{RSS title}
</title>{}
<link>{http://www.example.com/}
</link>{}
<description>{RSS description}
</description>{}
<language>{en}
</language>{}
<item>{}
<title>{News item title}
</title>{}
<link>{http://www.example.com/link/to/news/item/}
</link>{}
<guid isPermaLink="true">{identifier-5f4b02697d2006f72648ebd0d9c6ce96}
</guid>{}
<description>{Full news item text.}
</description>{}
<pubDate>{Fri, 01 Jul 2016 17:41:07 +0000}
</pubDate>{}
</item>{}
</channel>{}

This makes the tags modify to something like the tag followed by the value wrapped under brackets. Something like this <description>{RSS description}.

Time to parse

So we have a general tag parser which isolate the content. But we still need to catch the desired tags, so we need a filter, not one but many of them, as there are multiple tags. This is where we use case to be able to provide different scenarios.

With case command with our magic $TAG variable over our different desired tags. If you haven't use case in bash, here is the quick description.

Bash case statements are generally used to simplify complex conditionals when you have multiple different choices. Using the case statement instead of nested if statements will help you make your bash scripts more readable and easier to maintain.

Note: Although we are using case here, this might be replaced in the future with switch.

So controversy aside we go into building our case, the key here is the initial expression which means iterate through our desired tags.

   case $TAG in
      'item')
         title=''
         link=''
         pubDate=''
         description=''
         ;;

With this filter we have a $TAG and extract it to the different items which work as an array for title, link, pubDate and description.

And from that, we start iterating to the different tree. To something like this for title:

      'title')
         title="$CONTENT"
         ;;

And so did we have the following construct:

cat $1 | while identify_tags ; do
   case $TAG in
      'item')
         title=''
         link=''
         pubDate=''
         description=''
         ;;
      'title')
         title="$CONTENT"
         ;;
      'link')
         link="$CONTENT"
         ;;
      'pubDate')
         pubDate="$CONTENT"
         ;;
      'description')
         description="$CONTENT"
         ;;
      '/item')
         cat<<EOF

Pro Tip: notice the EOF and << operator. This allow us to recursively use a command after we did the whole filtering. Pretty meta hey!?!

So now we can fuse this with our function and we can expect this code:

#!/bin/sh
identify_tags () {
   local IFS='>'
   read -d '<' TAG CONTENT
}

cat $1 | while identify_tags ; do
   case $TAG in
      'item')
         title=''
         link=''
         pubDate=''
         description=''
         ;;
      'title')
         title="$CONTENT"
         ;;
      'link')
         link="$CONTENT"
         ;;
      'pubDate')
         # convert pubDate format for <time datetime="">
         datetime=$( date --date "$VALUE" --iso-8601=minutes )
         pubDate=$( date --date "$VALUE" '+%D %H:%M%P' )
         ;;
      'description')
         # convert '&lt;' and '&gt;' to '<' and '>'
         description=$( echo "$VALUE" | sed -e 's/&lt;/</g' -e 's/&gt;/>/g' )
         ;;
      '/item')
         cat<<EOF
<article>
<h3><a href="$link">$title</a></h3>
<p>$description
<span class="post-date">posted on <time
datetime="$datetime">$pubDate</time></span></p>
</article>
EOF
         ;;
      esac
done

Costumizations

This will create a pretty formatted HTML bit, however you could also do a CSV file by changing those <> to commas or semicolons.

"$title", "$link", "$description"

Or just forget about title and link and for something like a podcast, we might just be interested for the download links. (BTW the tag is embeded) and use curl or wget to download everything.

#!/bin/sh
identify_tags () {
   local IFS='>'
   read -d '<' TAG CONTENT
}

cat $1 | while identify_tags ; do
   case $TAG in
      'item')
         embed=''
         ;;
      'embed')
         title="$CONTENT"
         ;;
      '/item')
         cat<<EOF
$(wget -c $embed)
EOF
         ;;
      esac
done

So there you go, pretty nifty little script to get your latest podcast.

Make sure to let me know if you liked this on the comment or if you notice an issue. I will be happy to update it. Hope you learned a thing or two on how powerful your good ol bash can be.

Sort:  

Very detailed, thanks!