From Automation to Curation: scRUBYt.org Reinvented as a Web Directory Hub

Posted by admin

You can download my EURUKO (the European Ruby Conference) 2007 slides from here. Enjoy!

Originally built as a Ruby-based web scraping toolkit, scRUBYt! helped developers automate data extraction from websites with simple yet powerful syntax. It gained recognition during early Ruby conferences such as EURUKO 2007 and was used for scraping Google results, price comparison sites like PriceGrabber, and even ranking Ruby blogs using Alexa and Technorati data.

As the digital landscape evolved, so did our vision. While automation was once the core, we realized that curation is the new automation. Users today seek trusted sources — not just raw data, but high-quality links to websites worth visiting.

That’s why scRUBYt.org is being reinvented.

We’re now shifting focus from scraping to smart web link discovery and recommendation. Instead of collecting data behind the scenes, we now highlight popular, trending, and valuable websites across categories like tech, entertainment, games, communities, live sports, and more.

Our mission? To help users find what matters — faster, safer, and without the noise.

Posted by admin on Nov 13 2007 under News & Announcements | 3 Comments » |

Announcing JscRUBYt! – no more win32 problems (?)

Posted by admin

Thanks to Paul Nikitochkin a.k.a. pftg, scRUBYt! made a great leap to ensure win32 compatibility. Paul created JscRUBYt! – the JRuby version of scRUBYt! which should be easy to install under win32 even if you are not a level 64 microsoft compiling ninja (in fact, it requires no compiling, fiddling around with C/C++ or doing anything outside (J)Ruby-land (well, except of installing JRuby, of course)).

Please download JscRUBYt! from here and read the installation instructions written up by Paul.

Please let us know if you run into any problems and/or your experience using this package!

Posted by admin on Oct 02 2007 under News & Announcements | 9 Comments » |

A Hot New Release, 0.3.4 is Out – What’s New?

Posted by admin

After a long-long time, a lot of bugfixes, brainstorming sessions, coding, coding, coding, cans of red bull and coding, we are proud to present scRUBYt! 0.3.4!

Judging from the posts on the forum, people are not aware of quite lot powerful features (which is mainly my fault as I was lazy to do any documentation for the last 2 releases – but a cheatsheet and reference is on the way) – so I’d like to introduce a few new features which were added to scRUBYt! 0.3.4 to avoid this, at least for this release.

First of all there are 3 new pattern types, of which 2 are particularly interesting. Let’s start with the not-so-interesting one:

Constant pattern:
```
      pattern 'some constant text', :type => :constant
    
```
Sometimes I needed a piece of text or data which was not contained in the web page (or it was always constant, so scraping it would mean an unneeded overhead) – perhaps a comment or a required field in a feed or other predefined schema. Constant pattern comes handy exactly in these cases: the above example will produce:
```
some constant text
   
```

The two interesting ones are in a very-alpha stage (in fact one of them was implemented 2 days ago for a scenario) so they are more of a preview of what to expect in the future releases than full-fledged features. They are already usable to some extent, but a lot of tweaking, polishing and adding new functionality can be expected in the near future.

Text pattern:
```
      pattern 'td[some text]:all', :type => :text
    
```
A text pattern works differently than an XPath one: while the XPath pattern relies on the structure of the document, the text pattern doesn’t. This is essential in the case of some sites (the most typical example is perhaps wikipedia) which are not using a single template to present the content and/or the structure changes often, but there are some text labels or other constant text chunks which can aid the scraping. The semantic of the above example is:
```
Find all <td> tags which contain the text 'some text', wherever on the page.
```
I am sure you noticed the :all notation – currently :index (where index is a number, so :0, :1 etc.) is supported besides :all, meaning ‘give me the first (:0), second (:1) etc. occurrence of the match).
A lot of additions can be expected for the text pattern in the future (for example give me the longest text in a <td> or give me <td>s with a certain regexp etc.). As always, suggestions are warmly welcome!
Script pattern:
```
    pattern lambda {|x| x.gsub('x','y').downcase}
  
```
A script pattern is a way to execute an arbitrary Ruby block during scraping. It’s input, as always, is the output of it’s parent pattern, represented by ‘x’.
While this pattern type will be enhanced a lot in the future (allowing to choose more, arbitrary patterns as the input, possibility of specifying custom input, simplifying the syntax (the ‘lambda {|x| }’ stuff is constant so it will be most probably dropped) etc.) this pattern is already quite powerful as it is. The simplest use cases include filtering and modifying URLs, stripping white space or another string modifications like substitutions on the result, primitive branching etc. However, only your imagination is the limit here: you could do different operations on scraped prices, stock data, or running scraped coordinates through a geocoder. I am quite sure that script pattern will be a lot of fun, resulting in interesting uses.

There are some additions to the output functionality: to_hash now accepts a custom delimiter (for the cases when the output contained a comma, the default delimiter) and there is a new method: to_flat_xml, which produces a feed-like, flat xml instead of the hierarchical output generated by to_xml.

Logging was reworked completely by Tim Fletcher. The most notable difference is that by default, you won’t be overflooded with all the debug messages pouring from scRRUByt!. To enable logging, you have to explicitly add a line before your extractor:

Scrubyt.logger = Scrubyt::Logger.new
#your extractor begins here

Last but not least, a lot of bugs were fixed: the infamous regexp pattern bug, the encoding bug (scraping utf-8 pages should be ok now), a lot of fixes in the download pattern and other places.

jscRUBYt! and firescRUBYt! are on the way, so stay tuned!

Posted by admin on Sep 27 2007 under News & Announcements | 7 Comments » |

Let the Dogs Out!

Posted by CopperMonkey

scRUBYt! has dug its way into the investment business :). No joke – check out this scrummy tutorial created by Doug Bromley to find out more.

A refreshing mix of business and technology is served here: a nice scraper that returns dividend yields summarized all in one place and a useful example for form filling and submitting, page navigation and constrains.

Thnx Doug for the excellent job!

Posted by CopperMonkey on Sep 19 2007 under News & Announcements | 2 Comments » |

Scrapin’ Google in no sec

Posted by CopperMonkey

Google is back on stage! I have dusted one of the most known examples in scRUBYt!’s history and wrapped it in a brand new coat. Well, here it is, hot off the frying pan. If you are a scRUBYt! newbie this is a perfect place to start to ’scrub’ around .

Example of:

filling and submitting a textfield
extracting and using a href attribute
recursively crawling to the next page(s)

Goal:

Go to google.com. Enter ‘ruby’ into the search textfield and submit the form. Extract the url of the first 2 pages.

Solution:

Use the navigational commands ‘fetch’, ‘fill-textfield’, ’submit’ to navigate to the page of interest. There, extract the links with the pattern ‘link’. The URLs should be extracted with ‘link’’s child pattern, ‘url’ which is an attribute pattern. This will extract the first 10 results, but you need the first 20. To achieve this, the ‘next_page’ idiom should be used, with ‘:limit’ set to 2.

Check out the code:

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  #Perform the action(s)
  fetch 'http://www.google.com/ncr'
  fill_textfield 'q', 'ruby'
  submit
  #Construct the wrapper
  link "Ruby Programming Language" do
    url "href", :type => :attribute
  end
  next_page "Next", :limit => 2
end

puts google_data.to_xml

Check out the result here!.

Finally, if you are a visual type, check out this overview diagram:

Something not clear? Check out our tiny FAQ!

‘What’s ‘q’ supposed to mean?’

‘q’ is the name of the Google search box. How have I figured it out? Well, you can search the source code back and forth to find it, but I guess using tools like XPather for example is far more easy .

XPather?!?…never heard of it

If you are not familiar with XPather you should check out either this quick kick-off tutorial (recommended for smaller apetites or the full user guide (just to tease your taste buds).
As an extra topping here is a yamoo cheat sheet.

To install XPather go here (I assume you are using Firefox: 1.5.0.* – 2.0.0.*. If not, sorry folks, you have to come up with an other alternative ).

Still confused? This visual step-by-step ‘how-to’ may help…

The manipulation of this tool is actually very handy (that is one of the reasons it is recommended by the team). Give it a try and you’ll get the hang of it. Here is a little example as a demonstration (let’s assume you have DOM Inspector in your Firefox):

All you need to do is

1. go to www.google.com
2. launch the DOM Inspector (you can find it in under the Tools in any Mozilla window).
3. click on the first icon in the top row (see snapshot)

1. now click on the Google search box (if you did it right a red frame should appear around a box and blink several times)
2. go back to the DOM Inspector and enjoy the result

For those who like it to be visualized…(click on the image to enlarge it)

Posted by CopperMonkey on Aug 14 2007 under News & Announcements | 25 Comments » |

Scraping pricegrabber.com with scRUBYt!

Posted by CopperMonkey

This is a new series on scraping different (hopefully typical) web pages to show off various scRUBYt! techniques. We will start of with a pricegrabber.com extractor. Check out the big picture to see what will we accomplish today!

You will see an example of:

filling and submitting a textfield
crawling to the detail pages of the results
downloading images
scraping desired data (text, price, user rating)

Goal:

Go to http://cameras.pricegrabber.com.
Enter ‘canon EOS 20D’ into the search textfield and submit the form.
Download the item image from the detail page(s) and additional product information (like the lowest price) from the ‘Product Details’ tab.

Solution:

Use the navigational commands ‘fetch’, ‘fill_textfield’, ’submit’ to navigate to the list page containing items related to the search term ‘canon EOS 20D’
Crawl to the detail page by “clicking on” the link Canon EOS 20D Digital SLR Camera Body Only.
Choose the detail tab ‘Product Details’ by cliking on it.
Download the product image by providing its location’, ’src’ and finally setting a directory to download the images into using a ‘download’ pattern (note: you do not need to create the given directory, scRUBYt! does it for you! Look for it at the same location as your extractor is stored)
Scrape the desired data (in this case Description and Lowest Price).
Extract user rating with a regular expression.

The code:

require 'rubygems'
require 'scrubyt'

pricegrabber_data = Scrubyt:: Extractor.define do

  fetch 'http://cameras.pricegrabber.com'
  fill_textfield 'form_keyword', 'canon EOS 20D'
  submit

  camera 'Canon EOS 20D Digital SLR Camera Body Only', :generalize => false do
    camera_detail do
      detail_tab 'Product Details', :generalize => false do
        detail_detail do
          camera_record :generalize => false do
            image "http://ai.pricegrabber.com/pi/0/37/48/3748716_125.jpg" do
              image_url "src", :type => :attribute do
                download 'camera_images', :type => :download
              end
            end
            name 'EOS 20D Digital SLR Camera Body Only'
            lowestprice '$649.95'
            rating_and_reviews :contains => '(Read 31 Reviews)' do
              rating /(d.dd / d.dd)/
            end
          end
        end
      end
    end
  end
end

puts pricegrabber_data.to_xml

If you’d like to run, it cut’n’paste it from here (otherwise the code will be messed up).

If you are more of a visual type, check out this cart depicting the process.

Troubleshooting

Problem: “This extractor does not work!”

Solution: ‘Lowest Price’ changes quite frequently – check if the actual ‘Lowest Price’ at http://cameras.pricegrabber.com/digital/canon/m/3748716/details/ does match the one stated in the extractor. If not, simply “upgrade” the example with the actual correct data. Hmmm, this should work.

Problem: “The extractor returns data for the first listed item only. Where is the rest of data?”

Solution: The extractor was designed only to give a taste of how to solve similar scraping problems like it was stated in ‘Goals” therefore it is limited to the first item only. If you would like to scrape all the items on the list page simply remove :generalize => false from
“camera ‘Canon EOS 20D Digital SLR Camera Body Only’, :generalize => false do”. However, it will return data from the first list page only, while no crawling to the next pages is defined in the scraper (in our case it would be kind of insane since no next pages exist for this type of camera).

Posted by CopperMonkey on Jul 24 2007 under News & Announcements | 98 Comments » |

New things on the block

Posted by CopperMonkey

Hey all,

I hope things around scRUBYt! are going to boil again :-). I am happy to announce that the community has started a new series of step-by-step tutorials! The first piece has just hit the road, be sure to check it out. According to my estimate it hits the intermediate level so beginners don’t panic if you feel lost. This is just a pilot run, in the next tutorials all the basics will be covered.

I am eager to hear your feedback about the usability, design, tips for improvement, whatever. Contributions of any kind are warmly welcome!

I would like to announce that the scRUBYt! bug tracker is up and running – please send your bug reports, suggestions, feature requests etc. there (it’s a publicly accessible lighthouse< account). If you are having any issues, please let us know!

CopperMonkey

Posted by CopperMonkey on Jul 24 2007 under News & Announcements | 1 Comment » |

Ranking Ruby/Rails Blogs with scRUBYt! Using Alexa and Technorati

Posted by admin

As a (positive) side-effect of Ruby’s and RoR’s current popularity, we can witness the mushrooming of Ruby/Rails blogs of different size, shape, color and quality. While this is a great thing in general, it is easy to get overflooded with all this information pouring in from every side – much harder to pick a smaller set which you can monitor in your time devoted for reading Ruby blogs.

Fortunately there are some great Ruby/Rails blog aggregators like http://rubycorner.com/, http://www.planetrubyonrails.org/ – not to be confused with http://www.planetrubyonrails.com/ (here is why) or http://planet.caboo.se/.
The problem is that it’s clumsy to monitor all these aggregators – they contain hundreds of blogs each with a lot of overlapping.

This issue was partially solved by this Ruby on Rails Blog Aggregator Dupe Prevention yahoo pipe. However, there is a fundamental problem with that: You still have several hundreds of blogs – nearly 500 of them, and new ones coming out day by day. If you have time to read all these periodically, all the power to you – however, the rest of us would need a way to pick the top n ones in order to be able to cope with their reading.

I am proposing a solution for this problem, by performing these steps:

1. Collecting all the blog URLs from the above 4 aggreagators
2. Performing some data cleaning (e.g. http://www.site.com, http://site.com, http://site.com/blog and other variants, with and without trailing slashes are pointing to the same site) and dupe removal
3. Querying http://www.alexa.com and http://www.technorati.com for the popularity of each of the blogs on the list (this already generated two rankings: “Alexa top 30″ and “Technorati top 30″.

Alexa required some intervention by hand (more on that later).

Putting it all together: based on the two top 30 lists, generate a final, all-in-one Ruby/Rails blog top 10!

Let’s see an overview of the process – then I will explain each step in detail!

Let’s begin with scraping http://www.planetrubyonrails.com/. This extractor was the easiest to construct of all the four:

    planet_ror_com_data = Scrubyt::Extractor.define do
      fetch("http://www.planetrubyonrails.com/pages/channels")

      link "http://weblog.jamisbuck.org/"
    end

This wasn’t too hard, was it? We have just extracted all the blogs URLs from planetrubyonrails.com! To observe the result, all we need is to run this snippet:

planet_ror_com_data.to_hash.each{|h| puts h[:link]}

The result is:

http://weblog.jamisbuck.org/
http://www.therailsway.com/
http://weblog.rubyonrails.com/

Looks good so far (although there is a nil value – we will deal with that later).

Let’s move on to http://planet.caboo.se/ which is just slightly harder to scrape:

    planet_caboose_data = Scrubyt::Extractor.define do
      fetch("http://planet.caboo.se/")

      link_row do
        link "al3x/@href"
      end
    end

The situation changed a bit: We need to match the row containing the blog address and feed address first, then scrape the “href” attribute of the blog URL link. Note how is this done – by appending “/@href” to the example string.

The next site is http://rubycorner.com/. The new thing here compared to the first two scrapers is that the blog URLs can be found on different pages – ergo we need to crawl to all of those pages. Fortunately this is no problem for scRUBYt!:

    rubycorner_data = Scrubyt::Extractor.define do
      fetch 'http://rubycorner.com/blogs/lang/en'

      link "Atlantic Dominion Solutions/@href"
      next_page ">"
    end

next_page is a special pattern which instructs scRUBYt! to crawl to next pages (the example “>” is used to identify the next page link).

Last but not least, here is the scraping code for http://www.planetrubyonrails.org/

      fetch "http://www.planetrubyonrails.org/"

      link_container "div[Feeds]" do
        link "a work on process/@href", :generalize => true
      end
    end

The link_container pattern instructs scRUBYt! to get a div containing the text “Feeds”, so that we scrape the links only inside the appropriate div (otherwise we would have some false positives). The :generalize => true option is used to match all the links, not just the first one.

I don’t have too much to say about the data cleaning part – if you are interested, you can download the code used for generating this tutorial and check it out there.

Scraping alexa was great fun – mainly because they don’t want to be scraped! However, if they would like to stop a halfway serious script junkie, they will have to do better than this. Here is the alexa extractor:

  def scrape(site)
    @hidden = open("http://client.alexa.com/common/css/scramble.css").read.scan(/^.(.+) /).flatten unless @hidden
    alexa_data = Scrubyt:: Extractor.define do
        fetch "http://www.alexa.com/data/details/traffic_details?url=#{site}"

        td "td[Traffic Rank for]" do
          span '/span[3]' do
            html :type => :html_subtree
          end
        end
    end
    data = alexa_data.to_hash[0][:html]
    data ? descramble(data) : nil
  end

  def descramble(mess)
    data = mess.gsub!(//,”).split(//)
    data = data.reject{|d| d==”" || (@hidden.include? d.scan(/”(.+?)”/).flatten[0])}
    data.map!{|d| d.sub(/”.+”>/,”) }
    data.join(”)
  end

If the above code does not make too much sense to you, don’t worry – I will explain it in a follow-up post. What really matters is that it is possible (and relatively easy) to scrape alexa with scRUBYt! and Ruby.

Here is the extractor I have used to scrape technorati:

  def scrape(site)
    techno_data = Scrubyt::Extractor.define do
      fetch "http://www.technorati.com/blogs/#{site}?reactions"
      rank "/html/body/div/div/div/div/div/div" do
        rank_number /Rank: (.+)$/
      end
    end
    techno_data.to_hash.reject{|h| h.empty?}[0][:rank_number]
  end

Nothing special here – what you did not see yet in the other extractors is the use of a regular expression pattern (rank_number), which gets the number out of the corresponding element and leaves the trash behind.

The only manual step I had to make in this whole process was to tweak the alexa results a bit – because according to alexa.com, “Alexa’s traffic rankings are for top level domains only (e.g. domain.com). We do not provide separate rankings for subpages within a domain (e.g. www.domain.com/subpage.html) or subdomains (e.g. subdomain.domain.com)“. This in practice meant that I have checked all blogger.com, wordpress.org, jroller.com and similar domains whether they concrete pages are really that high ranked, or just their top-level domain is. This blog also made it to the top 30, but I had to throw it away because about half of the traffic goes to the scRUBYt! forum, and the remaining visits were not enough to stay inside the top 30.

If you are still with me, I guess you had enough of scraping, screwing, cleaning and assembling by now – and your ordeal is over! Here are the results:

Ruby/Rails blog roundup based on alexa.com traffic values

1	http://redhanded.hobix.com	(59,519)
2	http://www.rubyinside.com	(59,729)
3	http://hivelogic.com	(73,585)
4	http://www.loudthinking.com	(77,521)
5	http://mephistoblog.com	(81,383)
6	http://weblog.techno-weenie.net	(93,041)
7	http://slash7.com	(98,732)
8	http://nubyonrails.com	(99,955)
9	http://errtheblog.com	(105,525)
10	http://rubyonrailsblog.com	(117,842)
11	http://pjhyett.com	(118,985)
12	http://www.railsenvy.com	(123,414)
13	http://encytemedia.com	(127,451)
14	http://www.igvita.com/blog	(131,955)
15	http://blog.zenspider.com	(133,281)
16	http://juixe.com/techknow	(134,454)
17	http://peepcode.com	(136,202)
18	http://weblog.jamisbuck.org	(136,606)
19	http://project.ioni.st	(149,352)
20	http://railscasts.com	(154,733)
21	http://metaatem.net	(155,901)
22	http://www.danwebb.net	(162,936)
23	http://www.robbyonrails.com	(181,590)
24	http://www.urbanhonking.com/ideasfordozens	(184,096)
25	http://blog.codahale.com	(188,987)
26	http://drnicwilliams.com	(197,798)
27	http://topfunky.com	(199,064)
28	http://blog.innerewut.de	(200,308)
29	http://web2withrubyonrails.gauldong.net	(203,938)
30	http://brainspl.at	(209,069)

Ruby/Rails blog roundup based on technorati.com traffic values

1	http://www.loudthinking.com	(3,164)
2	http://metaatem.net	(5,173)
3	http://mephistoblog.com	(6,223)
4	http://hivelogic.com	(6,498)
5	http://www.rubyinside.com	(7,899)
6	http://errtheblog.com	(8,587)
7	http://nubyonrails.com	(10,277)
8	http://quotedprintable.com	(10,680)
9	http://slash7.com	(11,154)
10	http://redhanded.hobix.com	(11,154)
11	http://project.ioni.st	(11,957)
12	http://weblog.jamisbuck.org	(12,412)
13	http://blog.leetsoft.com	(13,270)
14	http://weblog.rubyonrails.com	(13,445)
15	http://www.danwebb.net	(16,844)
16	http://www.igvita.com/blog	(17,262)
17	http://www.chadfowler.com	(17,262)
18	http://blog.codahale.com	(17,601)
19	http://clarkware.com/cgi/blosxom	(22,954)
20	http://drnicwilliams.com	(23,098)
21	http://www.oreillynet.com/ruby/blog	(23,346)
22	http://peepcode.com	(24,393)
23	http://eigenclass.org	(24,393)
24	http://mir.aculo.us	(25,071)
25	http://on-ruby.blogspot.com	(26,808)
26	http://antoniocangiano.com	(29,929)
27	http://www.therailsway.com	(30,395)
28	http://www.softiesonrails.com	(31,790)
29	http://glu.ttono.us	(33,550)
30	http://www.ryandaigle.com	(34,998)

There were a few blogs which performed equally well on both lists – they were assembled into a top 10 list, which you can check out here.

I would really like to hear your opinion on this little experiment – whether you think it makes sense or it is completely off, how could it be improved in the future, what features could be added etc. If I’ll receive some positive feedback, I think I will work on the algorithm a bit more, and run it once in say every 3 months to see what’s happening around the Ruby/Rails blogosphere. Let me know what do you think!

scRUBYt! wiki

Posted by admin

We have decided it’s time to start a scRUBYt! wiki (thanks to Whitefolkz for coming up with the idea and starting to work on it). You can find the wiki here. Whitefolkz already posted some guidelines and additional info.

Please post your experience, extractors, guides, tutorials or any other nuggets of knowledge you think the others can benefit from! Thanks for your help in advance.

Posted by admin on Jun 02 2007 under News & Announcements | 2 Comments » |

scRUBYt! 0.3.0 released

Posted by admin

Thanks to thescRUBYt! forum, the team has received a *lot* of feedback – we have tried to incorporate the most features requested by the most people, also fixing a few annoying bugs. Well, I guess no one is interested in this introductory babbling, so here you go: the CHANGELOG for 0.3.0:

[NEW] complete rewrite of the output system, creating a solid
           foundation for more robust output functions
           (credit: Neelance)
[NEW] logging - no annoying puts messages anymore!
           (credit: Tim Fletcher)
[NEW] can index an example - e.g.
           link “more[5]”‘
           semantics: give me the 6th element with the text “more”
[NEW] can use XPath checking an attribute value, like
            “//div[@id=’content’]”
[NEW] default values for missing elements (first version was
            done in 0.2.8 but it did not work for all cases)
[NEW] possibility to click button with it’s text (instead of it’s index)
           (credit: Nick Merwin)
[NEW] clicking radio buttons
[NEW] can click on image buttons (by specifying the name of the
           button)
[NEW] possibility to extract an URL with one step, like so:
          link “The Difference/@href”
           i.e. give me the href attribute of the element matched by the
           example ‘The Difference’
[NEW] new way to match an element of the page:
           content “div[The Difference]”
           means ‘return the div which contains the string “The Difference”‘.
           This is useful if the XPath of the element is non-constant across
           the same site (e.g.sometimes a banner or add is added, sometimes
           not etc.)
[NEW] Clicking image maps; At the moment this is achieved by specifying
           an index, like
           click_image_map 3
          which means click the 4th link in the image map
[FIX] Replacing 240 ( ) with space in the preprocessing phase
         automatically
[FIX] Fixed: correctly downloading image if the src attribute had a leading
         space, as in