scRUBYt!
WWW::Mechanize and Hpricot on Steroids

Briefly...
scRUBYt! is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. The idea behind making scRUBYt! was to show a few simple concepts of Web extraction as a practical extension of this tutorial.
April 26th, 2007 at 5:18 am
Hi,
I’m new to Ruby and web scraping, and I was wondering if scRUBYt can be used to achieve the same result as the future commercial tool, Dapper ( www.dapper.net ).
I’m really interested in the subject and I’m looking forward to your replies,
titel
April 26th, 2007 at 6:39 am
Well, this is kind of apples to oranges. scRUBYt! has no GUI whatsoever, while dapper’s primary strength is in it’s GUI. On the other side, scRUBYt! can do other, sometimes more powerful things than dapper, and if this is still not enough, you can use pure Ruby inside any of the extractors.
The second great difference is that scRUBYt! is very poor on the output at the moment (it can just output XML, CSV and hash map), whereas dapper’s second strength besides it’s GUI lies exactly in the diverse set of output functions.
The comparison is also kind of hard because dapper is web based, and scRUBYt! is not (at the moment - among the plans is adding a Rails web frontend).
Of course there is also a difference in open/closed source software as well. scRUBYt! is open to ideas, patches and other contributions etc. from everybody. Also the extractors generated by it are free to use in any way the user likes. Not so with dapper - you do not even see the extractors, they are stored and executed solely on their servers (well, this could be also viewed as plus, since you need not worry about installing, configuring, providing resources etc.)
In my experience, dapper has serious problems with extracting the stuff precisely and fully correctly - it gets it right in 95% of the cases but it is very hard to do something about the 5%. In my opinion the algorithms behind scRUBYt! are already better,
and they will be improved a lot in the future.
To sum it up: scRUBYt! is quite immature at the moment and can not compete with OpenKapow, Dapper, teqlo and other products (well, in some features it’s better than those, but overall it is still too young and weak ATM).
However, if it will have a web frontend, a lot of missing features will be added and the whole thing will be stabilized, I think it will be a serious rival to dapper.
If you would like to continue the discussion or have questions, please join the forum!
April 27th, 2007 at 9:09 am
First of all, I want to thank you for your fast reply.
I must admit that after I read your reply I realized how imprecise my question was.
I was for sure aware of the differences concerning (open-closed source, the fact that is is not web based, etc)
titel
May 8th, 2007 at 5:48 pm
Greetings,
I can’t seem to sign up for your forums (no email received), so I’ll throw this out there… Why doesn’t this work?
http://pastie.caboo.se/60032
The basic idea is that I want to restrict the scraping to a subsection defined by a div that has an ‘id’ of ‘results’, and then build a match of the things underneath that. If there’s a better way, I haven’t been able to find it in any of the documentation or examples… I know the XPath provided works in HPricot, as I’ve test that too. ( http://pastie.caboo.se/60036 )
First off, it dies with a ‘nil’ ref in the ‘zip’ method. You can fix that by adding ‘.to_a’ to the last expression on line 130 (135 in 0.3.0) of xpathutils.rb. Once that’s fixed, you find that it only returns one result.
Plus, if you uncomment the exporter, it fails with a nil someplace deep in ruby2ruby, called from export.
If I strip the xpath search off the record invocation, it works, but gets the ads at the top as well.
Since the ads are sometimes not there (dependent on keyword), I can’t just refer to a specific div…
Ideas, suggestions, thoughts on when Agora will be sending email again?
In general, it’s a GREAT scraping system, btw! Speaking as someone who’s had an open source app that’s been scraping eBay for the last 7 years, this is a really, really nice implementation! You can turn around a scraper really fast with this…
– Morgan Schweers, CyberFOX!