Home arrow Ruby-on-Rails arrow Page 3 - Tracking the News with Google News
RUBY-ON-RAILS

Tracking the News with Google News


In this article, you'll create a graphical report from Google News RSS data, using a handy utility called FeedTools and a plug-in called CSS Graphs Helper. This article is excerpted from chapter 11 of the book Practical Reporting with Ruby and Rails, written by David Berube (Apress; ISBN: 1590599330).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 3
April 15, 2010
TABLE OF CONTENTS:
  1. · Tracking the News with Google News
  2. · Company News Coverage Reporting
  3. · Dissecting the Code
  4. · Creating the News Tracker Report Application
  5. · Dissecting the Code

print this article
SEARCH DEVARTICLES

Tracking the News with Google News - Dissecting the Code
(Page 3 of 5 )

First, the script in Listing 11-1 needs to create a connection to the database:

# If there's a config/database.yml file,
# read from that . . .

if File.exists?('./config/database.yml')
  require 'yaml'  
  ActiveRecord::Base.establish_connection(
                  YAML.load(File.read('config/database.yml'))['development']
                                        )

else
  # . . . otherwise, connect to the default settings.
  # Note that if don't you have the default MySQL settings below,
  # you should change them.

  ActiveRecord::Base.establish_connection(
    :adapter  => "mysql",
    :host     => "your_mysql_hostname_here",
    :username => "your_mysql_username_here",
    :password => "your_mysql_password_here",
    :database => "company_pr")
end

If you run the script from the root of a Rails application, the information from the config/database.yml file and the parameters for the development environment are loaded. If not, it manually creates the connection with the default parameters. Note that you can change ['development'] to ['production'] on the first establish_connection line if you would prefer to use the connection parameters from the production environment.

Next, let's examine the code that contains the single model and the schema:

class Stories < ActiveRecord::Base
end

unless Stories.table_exists?
  ActiveRecord::Schema.define do

    create_table :stories do |t|
      t.column :guid, :string
      t.column :title, :string
      t.column :source, :string
      t.column :url, :string
      t.column :published_at, :datetime
      t.column :created_at, :datetime
   
end
   
create_table :cached_feeds do |t|
      t.column :url , :string
      t.column :title, :string
      t.column :href, :string
      t.column :link, :string
      t.column :feed_data, :text
      t.column :feed_data_type, :string, :length=>25
      t.column :http_headers, :text
      t.column :last_retrieved, :datetime
    end

    # Without the following line,
    # you can't retrieve large results -
    # like those we use in this script.

    execute "ALTER TABLE cached_feeds
            CHANGE COLUMN feed_data feed_data MEDIUMTEXT;"
  end
end

This code creates a single model, Stories, and then creates a table for it. It also creates a second table named cached_feeds, which is used by FeedTools to store cached feeds. Note that the original schema was given in SQL on the FeedTools site, and it is a similar schema translated into a Rails migration. However, because the feed_data column contains too much data to be stored in a regular TEXT column, you use an ALTER TABLE ... CHANGE COLUMN statement to change the feed_data column to a MEDIUMTEXT type. (If Rails supported MEDIUMTEXT columns out of the box, you could have initially created it as a
MEDIUMTEXT column.)


Tip  You could create this database using Rails migrations as well, but in this case I've included it in this script. This is a simple way to create a database using Active Record, and it's independent of any Rails application, which means that you could use this loader and then make reports on the data from any reporting application. For example, if you did not have a Rails application and the developers in different departments of your company wrote the code to display the data as a Perl script, a Python program, an ASP.NET web application, a Crystal Reports report, or even a Microsoft Excel macro, they could still use this loader script.


Now that you have a database connection, a structure, and a model, you need to construct a Google News URL and download the data:

output_format = 'rss'
per_page = 100

query = ARGV[0]
query_encoded = URI.encode(query)

feed_url = "http://news.google.com/news" <<
           "?hl=en&ned=us&ie=UTF-8" <<
           "&num=" << per_page <<
           "&output=" << output_format <<
           "&q=" << query_encoded

FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"

feed=FeedTools::Feed.open(feed_url)

The URL was initially constructed by making a sample search on Google News, noting the RSS URL it generated, and creating code that generates the URL. You can follow a similar technique to create URL-generation code for other services, such as Google Blog Search, for example.

Two static variables, output_format and per_page, are used to create the URL. You can vary these as desired. Of course, you could have hard-coded them into the URL, but separating them makes them a bit easier to change. And note that you can simply change the output_format variable to atom to cause the output to be in Atom instead of RSS form. Since FeedTools can parse Atom instead of RSS seamlessly, the code will work with Atom without any other changes.

The third variable, query_encoded, is set by the application to be a URL-encoded version of the search string passed on the command line. The URI.encode function, provided by Ruby's built-in URI library, translates characters that have special meaning in URLs, such as the & character, into their encoded form.


Note  The difference between a Uniform Resource Identifier (URI) and a Uniform Resource Locator (URL) is generally unimportant. Strictly speaking, a URI can also be a Uniform Resource Name (URN), which can specify the identity of a thing, such as a book identified by its ISBN, without actually specifiying how to get it.


Next, you set the FeedTools.configurations[:feed_cache] variable to be equal to "FeedTools::DatabaseFeedCache", which causes FeedTools to use its built-in DatabaseFeedCache class. If you're inclined to write a custom FeedTools cache class--one that stores information in, say, a memcached server--you can pass in a different class name. Note that it's passed in as a string, not as a class constant or a symbol.

Then you open the feed using the
FeedTools::Feed.open method. This method is format-agnostic; it can be RSS, Atom, or CDF. Also, you don't need to use a separate method to download the URL and then pass it to FeedTools, because FeedTools downloads the feed and parses it in one step.

Finally, you add the stories to your MySQL database:

if !feed.live?
 
puts "feed is cached..."
 
puts "last retrieved: #{ feed.last_retrieved }"
 
puts "expires: #{ feed.last_retrieved + feed.time_to_live }"
else
 
feed.items.each do |feed_story|

      if not (Stories.find_by_title(feed_story.title) or
              Stories.find_by_url(feed_story.link) or
              Stories.find_by_guid(feed_story.guid))
       
puts "processing story '#{feed_story.title}' new"
        Stories.new do |new_story| 
          new_story.title= feed_story.title.gsub(/<[^>]*>/, '') # strip HTML
         
new_story.guid=feed_story.guid
  new_story.sourcename=feed_story.publisher.name if feed_story.publisher.name
          new_story.url=feed_story.link
          new_story.published_at = feed_story.published
          new_story.save 
        end
    else
      # do nothing
    end
  end
end

If the feed isn't live--in other words, if it's cached--you print a brief message stating that, and then print the date of when it was last cached and when the cache will expire. (You could go through the data-insertion loop either way, but cached feed items are guaranteed to be in the database already, so that would just be a waste of time.) Note that some programmers believe that unless feed.live? is better written as if not feed.live?.

If the feed is live, you iterate through all of the items in the feed by using the items method. You check if any stories exist with the same title, url, or guid; if none exist, you add the story to the database. Otherwise, the story is a duplicate and you don't add the item. In most cases, it's sufficient to check by guid alone. However, for news items, checking by all three is a good idea, since you may eventually want to have more aggregators, which may assign the guid or url for items differently.


blog comments powered by Disqus
RUBY-ON-RAILS ARTICLES

- Ruby-on-Rails Faces Second Security Flaw in ...
- Ruby 2.0 Prepped for February 2013 Release
- Why LinkedIn Switched from Ruby on Rails
- Adding Style with Action Pack
- Handling HTML in Templates with Action Pack
- Filters, Controllers and Helpers in Action P...
- Action Pack and Controller Filters
- Action Pack Categories and Events
- Logging Out, Events and Templates with Actio...
- Action Pack Sessions and Architecture
- More on Action Pack Partial Templates
- Action Pack Partial Templates
- Displaying Error Messages with the Action Pa...
- Action Pack Request Parameters
- Creating an Action Pack Registration Form

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials