Archive for August, 2007

Scrapin’ Google in no sec

googel_menu.png Google is back on stage! I have dusted one of the most known examples in scRUBYt!’s history and wrapped it in a brand new coat. Well, here it is, hot off the frying pan. If you are a scRUBYt! newbie this is a perfect place to start to ’scrub’ around :) .

Example of:

  • filling and submitting a textfield
  • extracting and using a href attribute
  • recursively crawling to the next page(s)

Goal:

Go to google.com. Enter ‘ruby’ into the search textfield and submit the form. Extract the url of the first 2 pages.

Solution:

Use the navigational commands ‘fetch’, ‘fill-textfield’, ’submit’ to navigate to the page of interest. There, extract the links with the pattern ‘link’. The URLs should be extracted with ‘link’’s child pattern, ‘url’ which is an attribute pattern. This will extract the first 10 results, but you need the first 20. To achieve this, the ‘next_page’ idiom should be used, with ‘:limit’ set to 2.

Check out the code:

  1. require ‘rubygems’
  2. require ’scrubyt’
  3.  
  4. google_data = Scrubyt::Extractor.define do
  5.   #Perform the action(s)
  6.   fetch ‘http://www.google.com/ncr’
  7.   fill_textfield ‘q’, ‘ruby’
  8.   submit
  9.   #Construct the wrapper
  10.   link "Ruby Programming Language" do
  11.     url "href", :type => :attribute
  12.   end
  13.   next_page "Next", :limit => 2
  14. end
  15.  
  16. puts google_data.to_xml

Check out the result here!.

Finally, if you are a visual type, check out this overview diagram:

Something not clear? Check out our tiny FAQ!

‘What’s ‘q’ supposed to mean?’

‘q’ is the name of the Google search box. How have I figured it out? Well, you can search the source code back and forth to find it, but I guess using tools like XPather for example is far more easy :) .

XPather?!?…never heard of it

If you are not familiar with XPather you should check out either this quick kick-off tutorial (recommended for smaller apetites :) or the full user guide (just to tease your taste buds). As an extra topping here is a yamoo cheat sheet.

To install XPather go here (I assume you are using Firefox: 1.5.0.* – 2.0.0.*. If not, sorry folks, you have to come up with an other alternative :) ).

Still confused? This visual step-by-step ‘how-to’ may help…

The manipulation of this tool is actually very handy (that is one of the reasons it is recommended by the team). Give it a try and you’ll get the hang of it. Here is a little example as a demonstration (let’s assume you have DOM Inspector in your Firefox):

All you need to do is

  1. go to www.google.com
  2. launch the DOM Inspector (you can find it in under the Tools in any Mozilla window).
  3. click on the first icon in the top row (see snapshot)

  4. image

  5. now click on the Google search box (if you did it right a red frame should appear around a box and blink several times)
  6. go back to the DOM Inspector and enjoy the result :)
  7. For those who like it to be visualized…(click on the image to enlarge it)

    image