Archive for March, 2007

New release: 0.2.6 with heaps of goodies!

We have just released the new version, 0.2.6 with some great new features, tons of bugfixes and lot of changes overall which should greatly affect the reliability of the system.

A lot of long-awaited features have been added: most notably, automatic crawling to the detail pages, which was the most requested feature in scRUBYt!’s history ever. I will add a tutorial and detailed example on how to use this feature, which enables you to easily crawl a whole site.

Another great addition is the improved example generation - you don’t have to use the whole text of the element you would like to match anymore - it is enough to specify a substring, and the first element that contains the string will be returned. Moreover, you can use also regular expressions, in which case the first element with text content matching the regexp will be returned. If this still won’t be enough, it is possible to create a compound example like this:

flight :begins_with => 'Arrival', :contains /\d{4}/, :ends_with => '20:00'
I guess it’s quite intuitive how should this work.

We have finished to fix an enormous amount of bugs and tested the whole system thoroughly, so the overall reliability should be improved a lot as opposed to the previous releases.

If you have any comments, questions, suggestions, please visit the brand new forum!

Last, but not least the full changelog:

* [NEW] Automatically crawling to and extracting from detail pages
* [NEW] Compound example specification: So far the example of a pattern
   had to be a string. Now it can be a hash as well, like 
  {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
* [NEW] More sophisticated example specification: Possible to use regexp
   as well, and need not (but still possible of course) to specify the whole 
   content of the node - nodes that contain the string/match the regexp 
   will be returned, too
* [NEW] Possibility to force writing text in case of non-leaf nodes
* [NEW] Crawling to the next page now possible via image links as well
* [NEW] Possibility to define examples for any pattern (before it did not 
  make sense for ancestors)
* [NEW] Implementation of crawling to the next page with different methods
* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
        some_url 'href', :type => :attribute
* [FIX] Crawling to the next page (the broken google example): if the next
        link text is not an <a>, traverse down until the <a> is found; if it is
        still not found, traverse up until it is found
* [FIX] Crawling to next pages does not break if the next link is greyed out
        (or otherwise present but has no href attribute (Credit: Robert Au)
* [FIX] DRY-ed next link lookup - it should be much more robust now as it
   uses the 'standard' example lookup
* [NEW] Correct exporting of detail page extractors
* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
* [NEW] New examples for the new featutres
* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and 
   stabilization

Please keep the feedback coming - your contributions are a key factor to scRUBYt!’s succes. This is not an exaggeration or a feeble attempt at flattery - since we (obviously) can not test everything on every possible page, we can make scRUBYt! truly powerful only if you send us all the quirks and problems you encounter during scraping, as well as your suggestions and ideas. Thanks everyone!

Forum

Click here to visit the forum!

scRUBYt! Forum Goes Live

Since more people have requested a forum for better information exchange, I have installed beast to enhance the discussion. After all, talking via WordPress comments and direct emails is not really cool :-). Please find the brand new forum here:

http://agora.scrubyt.org/

If you have any problems, errors, suggestions, please report them here, (or at the forum, if the problem is other than ‘the forum does not work :-). Enjoy!

Really Getting scRUBYt! up and Running on Ubuntu

ubuntu Here we go again: after so much posts and discussions, it seems there are still plenty of problems with installing scRUBYt! on Ubuntu, which were not addressed even by all the articles that appeared here. After some of the scRUBYt! users hunted down even more of them, I will try to sum up a really working HOWTO:

sudo apt-get install build-essential ruby ri rdoc ruby1.8-dev libopenssl-ruby1.8

Important! - make sure you have installed libopenssl-ruby1.8 - and not libopenssl-ruby for example.

If you have installed all of this, you can follow the standard installation. to install mechanize, hpricot and scRUBYt! (hint: it’s just sudo gem install-ing all three of them).

If you know a problem with installing or something that is still not working even if you have installed everything according to this tutorial, please let me know!

Problems on Ubuntu, vol. 2

ubuntu This problem is not directly scRUBYt! related, but since it was asked by more people, I guess it won’t hurt anybody if I will post a short note on it.

The error message goes something like this:

     Building native extensions. This could take a while…
    extconf.rb:1:in `require’: no such file to load — mkmf (LoadError)
    from extconf.rb:1

ERROR: While executing gem … (RuntimeError)
ERROR: Failed to build gem native extension.
Gem files will remain installed in /usr/lib/ruby/gems/1.8/gems/hpricot-0.5 for inspection.

In this case, on Ubuntu the problem is that ruby1.8-dev is not installed. Install it with the following command:

sudo apt-get install ruby1.8-dev

I know about debian’s package policy, but still I would be much happier if issuing a simple

sudo apt-get install ruby

would install all the packages (ruby, rdoc, ri, ruby-dev, etc…) - well, probably in the future this will be treated somehow. Until then, be sure to install ruby1.8-dev!

‘Learning’ and ‘Production’ extractors (with a reddit example)

I have mentioned the differences between learning and production extractors here and there in the tutorials, but it seems I either did not do this in a clear way, or I have presented this information on a not really well visible place. In any case, here it is again - it seems this one can not be emphasized enough.

When you are scraping a page for the first time, you are constructing a learning extractor. The purpose of a learning extractor is to show the system what would you like to extract - and nothing more, really. Once you have shown scRUBYt! what to extract, you can forget about the examples, and about the whole learning extractor altogether. So the answer to the really frequently asked question ‘This is nice and all, but what should I do if the page changes,and the examples are not there anymore?’ is: nothing. scRUBYt! takes care of this matter instead of you.

Let’s see an example of this in the practice. Let’s say I would like to scrape the title of the articles on reddit, so create a simple extractor (we could extract number of votes, comments or whatever else, but for the sake of illustration let’s keep this simple):

  1. require "rubygems"
  2. require "scrubyt"
  3.  
  4. reddit_data = Scrubyt::Extractor.define do
  5.   fetch "http://reddit.com"
  6.  
  7.   article_title("I am not a state secret").select_indices([:first,:every_third])
  8. end
  9.  
  10. reddit_data.to_xml.write($stdout, 1)
  11. Scrubyt::ResultDumper.print_statistics(reddit_data)

The result should be something like:

[MODE] Learning
[ACTION] fetching document: http://reddit.com
Extraction finished succesfully!

  <root>
    <article_title>I am not a state secret</article_title>
    <article_title>A Hundred couples having sex in one room</article_title>
    <article_title>I&#39;m A Mac User And I Love This Article Tearing Mac Users A Collective New One</article_title>
    ...
    ...
    other titles etc.
    <article_title>(video) Daily Show: Bigotry at its American Best... Texans respond to a proposed Mosque in their Neighborhood..</article_title>
  </root>

    article_title extracted 25 instances.

Now comes the usual question: ‘OK, but what if the article “I am not a state secret” won’t be featured on the frontpage anymore - the extractor would break!!!’.

Well, this is the point when the production extractor comes into play: once scRUBYt! ‘understood’ what would you like to scrape, and you think it understood it well (if not, you have to tweak the scraper further), it is time to turn the learning extractor into a production one. How? (Actually, this is the second most FAQ after the first one is answered). Well, believe it or not, this is very easy, and takes one line of scRUBYt! code:

reddit_data.export(__FILE__)

This was not hard, was it? Now, you should have a production extractor, named

reddit_data_extractor_export.rb

and it should contain something like this:

require 'rubygems'
require 'scrubyt'

reddit_data = Scrubyt::Extractor.define do
  fetch 'http://reddit.com'

   (article_title "/html/body/div/table/tr/td/a").select_indices([:first,:every_third])
end

reddit_data.to_xml.write($stdout, 1)

You see? The example is gone, and you will be able to run this extractor anytime, regardless of the content of the page.

I hope this makes things a bit more clear now…

Can Somebody Translate this (from Japanese)?

A few days ago I have spotted an interesting-looking URL among my incoming links:

http://www.rubyist.net/~matz/


It looks like this is Matz’s homepage, so I would be really happy if somebody could translate me those 3 lines to English. The full link to the lines in question is here:

http://www.rubyist.net/~matz/20070222.html#p04


Thanks in advance!