Archive for July, 2007
Scraping pricegrabber.com with scRUBYt!
You will see an example of:
- filling and submitting a textfield
- crawling to the detail pages of the results
- downloading images
- scraping desired data (text, price, user rating)
Goal:
- Go to http://cameras.pricegrabber.com.
- Enter ‘canon EOS 20D’ into the search textfield and submit the form.
- Download the item image from the detail page(s) and additional product information (like the lowest price) from the ‘Product Details’ tab.
Solution:
- Use the navigational commands ‘fetch’, ‘fill_textfield’, ’submit’ to navigate to the list page containing items related to the search term ‘canon EOS 20D’
- Crawl to the detail page by “clicking on” the link Canon EOS 20D Digital SLR Camera Body Only.
- Choose the detail tab ‘Product Details’ by cliking on it.
- Download the product image by providing its location’, ’src’ and finally setting a directory to download the images into using a ‘download’ pattern (note: you do not need to create the given directory, scRUBYt! does it for you! Look for it at the same location as your extractor is stored)
- Scrape the desired data (in this case Description and Lowest Price).
- Extract user rating with a regular expression.
The code:
- require ‘rubygems’
- require ’scrubyt’
- pricegrabber_data = Scrubyt:: Extractor.define do
- fetch ‘http://cameras.pricegrabber.com‘
- fill_textfield ‘form_keyword’, ‘canon EOS 20D’
- submit
- camera ‘Canon EOS 20D Digital SLR Camera Body Only’, :generalize => false do
- camera_detail do
- detail_tab ‘Product Details’, :generalize => false do
- detail_detail do
- camera_record :generalize => false do
- image "http://ai.pricegrabber.com/pi/0/37/48/3748716_125.jpg" do
- image_url "src", :type => :attribute do
- download ‘camera_images’, :type => :download
- end
- end
- name ‘EOS 20D Digital SLR Camera Body Only’
- lowestprice ‘$649.95‘
- rating_and_reviews :contains => ‘(Read 31 Reviews)‘ do
- rating /(d.dd / d.dd)/
- end
- end
- end
- end
- end
- end
- end
- puts pricegrabber_data.to_xml
If you’d like to run, it cut’n'paste it from here (otherwise the code will be messed up).
If you are more of a visual type, check out this cart depicting the process.
Troubleshooting
Problem: “This extractor does not work!”
Solution: ‘Lowest Price’ changes quite frequently - check if the actual ‘Lowest Price’ at http://cameras.pricegrabber.com/digital/canon/m/3748716/details/ does match the one stated in the extractor. If not, simply “upgrade” the example with the actual correct data. Hmmm, this should work.
Problem: “The extractor returns data for the first listed item only. Where is the rest of data?”
Solution: The extractor was designed only to give a taste of how to solve similar scraping problems like it was stated in ‘Goals” therefore it is limited to the first item only. If you would like to scrape all the items on the list page simply remove :generalize => false from “camera ‘Canon EOS 20D Digital SLR Camera Body Only’, :generalize => false do”. However, it will return data from the first list page only, while no crawling to the next pages is defined in the scraper (in our case it would be kind of insane since no next pages exist for this type of camera).
New things on the block
Hey all,
I hope things around scRUBYt! are going to boil again :-). I am happy to announce that the community has started a new series of step-by-step tutorials! The first piece has just hit the road, be sure to check it out. According to my estimate it hits the intermediate level so beginners don’t panic if you feel lost. This is just a pilot run, in the next tutorials all the basics will be covered.
I am eager to hear your feedback about the usability, design, tips for improvement, whatever. Contributions of any kind are warmly welcome!
I would like to announce that the scRUBYt! bug tracker is up and running - please send your bug reports, suggestions, feature requests etc. there (it’s a publicly accessible lighthouse< account). If you are having any issues, please let us know!
CopperMonkey
You are currently browsing the scRUBYt! blog archives for July, 2007.