Archive for April, 2007

New release, 0.2.8 is out (for a week already ;-)

Well I have been so busy recently that I literally did not have time to announce it here, so if anyone did not notice yet, a new release is out!

I guess the changelog is worth more than thousand words, so without too much further ado, here we go:

[NEW] download pattern: download the file pointed to by the
       parent pattern
[NEW] checking checkboxes
[NEW] basic authentication support
[NEW] default values for missing elements (basic version)
[NEW] possibility to resolve relative paths against a custom url
[NEW] first simple version of to_csv and to_hash
[NEW] complete rewrite of the exporting system (Credit: Neelance)
[NEW] first version of smart regular expressions: they are constructed
       from examples, just as regular expressions (Credit: Neelance)
[NEW] Possibility to click the n-th link
[FIX] Clicking on links using scRUBYt's advanced example lookup (i.e.
       you can use :begins_with etc.)
[NEW] Forcing writing text of non-leaf nodes with :write_text => true
[NEW] Possibility to set custom user-agent; Specified default user agent
       as Konqueror
[FIX] Fixed crawling to detail pages in case of leaving the
       original site (Credit: Michael Mazour)
[FIX] fixing the '//' problem - if the relative url contained two
       slashes, the fetching failed
[FIX] scrubyt assumed that documents have a list of nested elements
       (Credit: Rick Bradley)
[FIX] crawling to detail pages works also if the parent pattern is
       a string pattern
[FIX] shorcut url fixed again
[FIX] regexp pattern fixed in case it's parent was a string
[FIX] refactoring the core classes, lots of bugfixes and stabilization

In the near future, you can expect some tutorials on the new stuff (mainly crawling to the detail pages). If you have any questions, suggestions, ideas, please visit the scRUBYt! forum and tell us!

Note for windows user: As of 0.2.8, scRUBYt! depends on ParseTree and Ruby2Ruby - unfortunately it seems ParseTree is not that trivial to set up on windows. However, we are currently working on a new project to solve this problem, and we are making quite good progress so I believe for the next release, 0.3.0, this obstacle will be blown away. Until then windows users should either install scRUBYt! on cygwin, install ParseTree somehow or use 0.2.6 until we are ready with the Ruby bridge to ParseTree which will make the installation on windows possible without the need to compile C.

Ruby Web Scraping Tool Guide

Hpricot vs Mechanize vs ScrAPI vs Watir vs ScRUBYt! vs…? Which one should you pick to solve your upcoming Web scraping project and why?

Since I got this question more times recently, I will try to answer it here. This article is not about how cool scRUBYt! is and how much the others suck, concluding that you should use scRUBYt! only and nothing else. I will try to be as objective as possible and strive to discuss the more generic features of web scraping tools, rather than talk about scRUBYt! itself. (You can find plenty of that on this blog and by visiting the forum if you are interested.)

First of all, since Web scraping is hard to define precisely (it means a bit different thing to everyone, ranging from grabbing a few elements from a page to spidering and crawling whole sites and turning their content into structured format), I believe it is somewhat impossible to answer this question - however, since I am fond of hard-to-solve problems, I will give it a shot and try nevertheless :-).

There is a plethora of tools that can be used to aid Web scraping in Ruby - this article on web scraping describes most of the important ones. A quick recap of the packages mentioned in this article:

  • The main focus of WWW::Mechanize and (Fire)Watir is simulating web navigation (fetching documents, clicking links, submitting forms etc.). For plain old pages (i.e. no AJAX, JavaScript) Mechanize is a clear winner beacuse of speed issues.
  • Hpricot is a handy and extremely fast HTML parser which offers also (pseudo)XPath and CSS selectors making web scraping possible. Though you have the most alternatives here (HTML parsers/XPath evaluators), if you would like to stay on this level, I definitely recommend Hpricot.
  • ScrAPI and scRUBYt! are high-level web scraping frameworks, suited for more complex tasks. They are trying to hide the details of HTML and lookup of the elements you would like to grab (and in scRUBYt’s case also navigation, like fetching pages, following next links, crawling to new pages) and figure out your intention which you can feed into them through the provided DSLs.

Your first choice to make is whether you would like to use a low-level tool (usually smaller projects) or a high-level one (more complex scenarios). I believe this problem is as old as the first software package ever written, so it is definitely not a Ruby web scraping tool question in the first place. The dilemma is the following:

  • Using a low level tool you have maximal freedom: you can do everything as you like and you are not constrained by anyone other’s decisions or design. On the other hand, most probably you will ‘reinvent the wheel’, i.e. waste your time writing code that was already implemented a mozillion times by other people (who potentially have more expertise and/or time to make it better) or fiddle around with low-level quirks.
  • A high level package offers you a lot of goodies which you can just take and use out-of-the box - depending on the implementation, possibly allowing you plenty of space to tweak and configure these prefabricated components. However, this is also the downside of this option: if the creator of the tool did not add a feature you need, it’s much harder to change things to your liking than with a low-level package.

Talking about Ruby web scraping tools, the low level ones give you great freedom for doing whatever you want - for the price of much more code and the need to treat messy HTML and other low-level stuff like different protocols, HTTP connections, and proxies for example. On the other hand, scRUBYt! can scrape a page in as few as 5-6 lines of code - however, if there is a bug or something is missing/not working properly, you might have a much bigger problem than in the first case (unless somebody fixes it for you, that is). So which is the right choice to take in this situation?

You should consider more parameters to answer this question:

  • Scenario size - If you would like to accomplish something small, it’s possible that the flexibility of a low-level package outweighs the constraints imposed by the heavier frameworks. However, if you would like to do advanced things like spidering a whole site or match elements based on complicated criteria, you should consider to check out a high-level tool.
  • Amount of low level stuff - How much of the code will be web scraping and how much pure Ruby would you like to use in between? If you would like to leave yourself plenty of space for customizing and low level tweaking, and/or you plan to use a lot of Ruby code besides the scraping, probably a low-level tool is better. However, if your code will be mostly (or fully) Web scraping, it’s possible you will be better off with scRUBYt! or scrAPI.
  • The ‘NOW’ factor - In a lot of cases, this point is enough alone in itself to make the decision. Sometimes you just need to do something NOW - and framework X does not contain the feature you can not live without. In this case you have several options: you can either code up the missing functionality yourself (sometimes this is still faster if you would like to use the other powerful features of the framework), ask the maintainer of the package for this feature and hope he will respond ‘we are just hacking on it’ :-) If neither of these alternatives apply, and you really want to move on, sometimes you just have to omit framework X, no matter how cool it looks like and how great it will be if the missing feature will be added. You need to accomplish your task NOW…
  • Framework evolution - If you are planning to do a bigger scenario, you should also consider whether the framework you would like to use is an ongoing project, and if yes, what are the future plans of it’s developers. A ‘live’ framework is always a better bet - during it’s development, bugs will be fixed, features will be added and most probably some support will be provided.
  • Community and Support - This can be also a deciding factor on it’s own. Since web scraping can be (read: is) a very tedious process, it is always good to have some support from the community around the framework to help you through the hardest times. I think scRUBYt! already has a great community and a nice support - check the scRUBYt! forum and see it yourself.
  • The future: maintainability and robustness - Unless you are going to write a throwaway scraper, you should really consider the maintainability issue. Web sites change over time, and they change often. Using low-level scraping frameworks, this will bring in a lot of extra work: with every modification of the page, you will have to rewrite your scraper. At the moment, scrAPI is the most robust framework from this point of view - since it uses CSS selectors which are the most resistant to change. scRUBYt! extractors are also easy to maintain - if the page changes, you will have to rewrite just the examples (or if the examples are still there, not even those) and regenerate the extractors which will work automagically again. In the future we are planning to add CSS selectors to scRUBYt!, as well as different automatic adaptation techniques to be resistant to change.

The need for navigation (clicking links, submitting forms, crawling to new pages etc) is also an important factor - if you need it, you will have to either combine more of the above tools, or decide between Mechanize (which now contains Hpricot) or scRUBYt! (built on Mechanize and Hpricot, FireWatir integration coming soon) which are the only tools capable of both web site navigation and some serious scraping (O.K., FireWatir can also do some scraping, but it’s much less powerful as Hpricot’s or scRUBYt!’s).

Every project is different. If you have chosen X for a simple, small project, it does not necessarily mean you will reach after it again when you will be crawling a set of sites. Take your time to consider all the requirements and goals you would like to achieve - choosing wisely can save you a lot of time.