Archive for July, 2007

Scraping pricegrabber.com with scRUBYt!

pricegrabber
This is a new series on scraping different (hopefully typical) web pages to show off various scRUBYt! techniques. We will start of with a pricegrabber.com extractor. Check out the big picture to see what will we accomplish today!

You will see an example of:

  • filling and submitting a textfield
  • crawling to the detail pages of the results
  • downloading images
  • scraping desired data (text, price, user rating)

Goal:

  • Go to http://cameras.pricegrabber.com.
  • Enter ‘canon EOS 20D’ into the search textfield and submit the form.
  • Download the item image from the detail page(s) and additional product information (like the lowest price) from the ‘Product Details’ tab.

Solution:

  • Use the navigational commands ‘fetch’, ‘fill_textfield’, ’submit’ to navigate to the list page containing items related to the search term ‘canon EOS 20D’
  • Crawl to the detail page by “clicking on” the link Canon EOS 20D Digital SLR Camera Body Only.
  • Choose the detail tab ‘Product Details’ by cliking on it.
  • Download the product image by providing its location’, ’src’ and finally setting a directory to download the images into using a ‘download’ pattern (note: you do not need to create the given directory, scRUBYt! does it for you! Look for it at the same location as your extractor is stored)
  • Scrape the desired data (in this case Description and Lowest Price).
  • Extract user rating with a regular expression.

The code:

  1. require ‘rubygems’
  2. require ’scrubyt’
  3.  
  4. pricegrabber_data = Scrubyt:: Extractor.define do
  5.  
  6.   fetch ‘http://cameras.pricegrabber.com
  7.   fill_textfield ‘form_keyword’, ‘canon EOS 20D’
  8.   submit
  9.  
  10.   camera ‘Canon EOS 20D Digital SLR Camera Body Only’, :generalize => false do
  11.     camera_detail do
  12.       detail_tab ‘Product Details’, :generalize => false do
  13.         detail_detail do
  14.           camera_record :generalize => false do
  15.             image "http://ai.pricegrabber.com/pi/0/37/48/3748716_125.jpg" do
  16.               image_url "src", :type => :attribute do
  17.                 download ‘camera_images’, :type => :download
  18.               end
  19.             end
  20.             name ‘EOS 20D Digital SLR Camera Body Only’
  21.             lowestprice ‘$649.95
  22.             rating_and_reviews :contains => ‘(Read 31 Reviews)do
  23.               rating /(d.dd / d.dd)/
  24.             end
  25.           end
  26.         end
  27.       end
  28.     end
  29.   end
  30. end
  31.  
  32. puts pricegrabber_data.to_xml

If you’d like to run, it cut’n'paste it from here (otherwise the code will be messed up).

If you are more of a visual type, check out this cart depicting the process.

Troubleshooting

Problem: “This extractor does not work!”

Solution: ‘Lowest Price’ changes quite frequently - check if the actual ‘Lowest Price’ at http://cameras.pricegrabber.com/digital/canon/m/3748716/details/ does match the one stated in the extractor. If not, simply “upgrade” the example with the actual correct data. Hmmm, this should work.

Problem: “The extractor returns data for the first listed item only. Where is the rest of data?”

Solution: The extractor was designed only to give a taste of how to solve similar scraping problems like it was stated in ‘Goals” therefore it is limited to the first item only. If you would like to scrape all the items on the list page simply remove :generalize => false from “camera ‘Canon EOS 20D Digital SLR Camera Body Only’, :generalize => false do”. However, it will return data from the first list page only, while no crawling to the next pages is defined in the scraper (in our case it would be kind of insane since no next pages exist for this type of camera).

New things on the block

Hey all,

I hope things around scRUBYt! are going to boil again :-). I am happy to announce that the community has started a new series of step-by-step tutorials! The first piece has just hit the road, be sure to check it out. According to my estimate it hits the intermediate level so beginners don’t panic if you feel lost. This is just a pilot run, in the next tutorials all the basics will be covered.

I am eager to hear your feedback about the usability, design, tips for improvement, whatever. Contributions of any kind are warmly welcome!

I would like to announce that the scRUBYt! bug tracker is up and running - please send your bug reports, suggestions, feature requests etc. there (it’s a publicly accessible lighthouse< account). If you are having any issues, please let us know!

CopperMonkey