Archive for June, 2007

Ranking Ruby/Rails Blogs with scRUBYt! Using Alexa and Technorati

ubuntu As a (positive) side-effect of Ruby’s and RoR’s current popularity, we can witness the mushrooming of Ruby/Rails blogs of different size, shape, color and quality. While this is a great thing in general, it is easy to get overflooded with all this information pouring in from every side - much harder to pick a smaller set which you can monitor in your time devoted for reading Ruby blogs.

Fortunately there are some great Ruby/Rails blog aggregators like http://rubycorner.com/, http://www.planetrubyonrails.org/ - not to be confused with http://www.planetrubyonrails.com/ (here is why) or http://planet.caboo.se/. The problem is that it’s clumsy to monitor all these aggregators - they contain hundreds of blogs each with a lot of overlapping.

This issue was partially solved by this Ruby on Rails Blog Aggregator Dupe Prevention yahoo pipe. However, there is a fundamental problem with that: You still have several hundreds of blogs - nearly 500 of them, and new ones coming out day by day. If you have time to read all these periodically, all the power to you - however, the rest of us would need a way to pick the top n ones in order to be able to cope with their reading.

I am proposing a solution for this problem, by performing these steps:

  1. Collecting all the blog URLs from the above 4 aggreagators
  2. Performing some data cleaning (e.g. http://www.site.com, http://site.com, http://site.com/blog and other variants, with and without trailing slashes are pointing to the same site) and dupe removal
  3. Querying http://www.alexa.com and http://www.technorati.com for the popularity of each of the blogs on the list (this already generated two rankings: “Alexa top 30″ and “Technorati top 30″.
  4. Alexa required some intervention by hand (more on that later).
  5. Putting it all together: based on the two top 30 lists, generate a final, all-in-one Ruby/Rails blog top 10!

Let’s see an overview of the process - then I will explain each step in detail!

full_workflow

Let’s begin with scraping http://www.planetrubyonrails.com/. This extractor was the easiest to construct of all the four:

  1. planet_ror_com_data = Scrubyt::Extractor.define do
  2.       fetch("http://www.planetrubyonrails.com/pages/channels")
  3.  
  4.       link "http://weblog.jamisbuck.org/"
  5.     end

This wasn’t too hard, was it? We have just extracted all the blogs URLs from planetrubyonrails.com! To observe the result, all we need is to run this snippet:

  1. planet_ror_com_data.to_hash.each{|h| puts h[:link]}
The result is:
http://weblog.jamisbuck.org/
http://www.therailsway.com/
http://weblog.rubyonrails.com/
http://blogs.relevancellc.com/
http://www.anarchogeek.com/
http://www.bencurtis.com/
http://slash7.com/
http://blog.lathi.net/
http://convergentarts.com/
http://blog.innerewut.de/
http://codefluency.com/rails-views
http://www.busyashell.com/blog/
http://curthibbs.wordpress.com/
http://blog.bleything.net/
http://www.loudthinking.com/
http://railstips.org/
http://null.in/
http://www.jvoorhis.com/
http://www.puneruby.com/blog
http://sitekreator.com/satishtalim/rubyblog.html
http://www.rubyinside.com/
http://www.chadfowler.com/index.cgi
http://dibya.wordpress.com/
http://rubymerchant.blogspot.com/index.html
http://www.rubyonrailsblog.com/
http://blog.caboo.se/
http://www.ryandaigle.com/
http://vinsol.com/
http://jroller.com/page/obie
http://www.infoq.com/
http://drnicwilliams.com/
http://www.danwebb.net/
http://www.lukeredpath.co.uk/
http://blog.testingrails.com/
http://blog.vixiom.com/
http://robertrevans.com/
http://redhanded.hobix.com/
http://benmyles.com/
http://weblog.techno-weenie.net/
http://www.clarkware.com/cgi/blosxom
http://mir.aculo.us/
http://www.oreillynet.com/ruby/blog/
http://www.robbyonrails.com/
http://blog.grayproductions.net/
http://www.rubyquiz.com/
http://www.codyfauser.com/
http://railsexpress.de/blog/
http://www.softiesonrails.com/
http://blog.ericgoodwin.com/
http://iamrice.org/
http://blog.hasmanythrough.com/
http://david.goodlad.ca/
http://nubyonrails.topfunky.com/
http://onrails.org/
http://www.railtie.net/
http://brainspl.at/
http://podcast.rubyonrails.org/
http://blog.leetsoft.com/
http://errtheblog.com/
http://jystewart.net/process
http://www.pluginaweek.org/
http://mephistoblog.com/
http://encytemedia.com/
http://evil.che.lu/
http://technomancy.us/
http://poocs.net/
http://www.opensoul.org/
http://www.mathewabonyi.com/
http://www.railslivecd.org/
http://blog.methodmissing.com/
http://sporkmonger.com/breakpoint_client
http://typo.onxen.info/
http://blog.inquirylabs.com/
http://rashkovskii.com/
http://www.infused.org/
http://interblah.net/
http://streamlinedframework.com/
http://toolmantim.com/
http://webonrails.wordpress.com/
http://www.sapphiresteel.com/
http://eigenclass.org/hiki.rb
http://www.pjhyett.com/
http://jonathan.tron.name/
http://amazing-development.com/
http://merubyyoujane.com/
http://www.fallenrogue.com/
http://troubleseeker.com/
http://blog.nicksieger.com/
http://blog.methodmissing.com/
http://schf.uc.org/
http://www.brianketelsen.com/
http://www.simplisticcomplexity.com/
http://fluctisonous.com/
http://www.pinupgeek.com/
http://web2withrubyonrails.gauldong.net/
http://www.koziarski.net/
http://www.notsostupid.com/
http://www.shanesbrain.net/
http://blog.evanweaver.com/
http://www.zerosum.org/devblog
http://blog.zenspider.com/
http://townx.org/taxonomy/term/5/0
http://www.fromdelhi.com/
http://myles.eftos.id.au/blog
http://www.jeremyhubert.com/
http://www.jason-palmer.com/
http://seanicus.blogspot.com/
http://ajaxonrails.wordpress.com/
nil
http://www.matthewbass.com/blog
http://rails.co.za/
http://www.shifteleven.com/
http://peepcode.com/
http://blog.wolfman.com/
http://blog.dannyburkes.com/
http://jayfields.blogspot.com/
http://www.freeonrails.com/
http://techcheatsheets.com/tag/ruby%20or%20rails
http://rails-engines.org/rss/news/
http://www.railsenvy.com/
http://www.eribium.org/
http://hobocentral.net/blog
http://maintainable.com/articles
http://railspikes.com/
http://radiantcms.org/
http://hackety.org/
http://adam.blogs.bitscribe.net/
http://www.rubyfleebie.com/
http://www.stephenbartholomew.co.uk/
http://izumi.plan99.net/blog
http://blog.davidchelimsky.net
Looks good so far (although there is a nil value - we will deal with that later).

Let’s move on to http://planet.caboo.se/ which is just slightly harder to scrape:

  1. planet_caboose_data = Scrubyt::Extractor.define do
  2.       fetch("http://planet.caboo.se/")
  3.  
  4.       link_row do
  5.         link "al3x/@href"
  6.       end
  7.     end
The situation changed a bit: We need to match the row containing the blog address and feed address first, then scrape the “href” attribute of the blog URL link. Note how is this done - by appending “/@href” to the example string.

The next site is http://rubycorner.com/. The new thing here compared to the first two scrapers is that the blog URLs can be found on different pages - ergo we need to crawl to all of those pages. Fortunately this is no problem for scRUBYt!:

  1. rubycorner_data = Scrubyt::Extractor.define do
  2.       fetch ‘http://rubycorner.com/blogs/lang/en’
  3.  
  4.       link "Atlantic Dominion Solutions/@href"
  5.       next_page ">"
  6.     end
next_page is a special pattern which instructs scRUBYt! to crawl to next pages (the example “>” is used to identify the next page link).

Last but not least, here is the scraping code for http://www.planetrubyonrails.org/

  1. fetch "http://www.planetrubyonrails.org/"
  2.  
  3.       link_container "div[Feeds]" do
  4.         link "a work on process/@href", :generalize => true
  5.       end
  6.     end
The link_container pattern instructs scRUBYt! to get a div containing the text “Feeds”, so that we scrape the links only inside the appropriate div (otherwise we would have some false positives). The :generalize => true option is used to match all the links, not just the first one.

I don’t have too much to say about the data cleaning part - if you are interested, you can download the code used for generating this tutorial and check it out there.

Scraping alexa was great fun - mainly because they don’t want to be scraped! However, if they would like to stop a halfway serious script junkie, they will have to do better than this. Here is the alexa extractor:

  1. def scrape(site)
  2.     @hidden = open("http://client.alexa.com/common/css/scramble.css").read.scan(/^\.(.+) /).flatten unless @hidden
  3.     alexa_data = Scrubyt:: Extractor.define do
  4.         fetch "http://www.alexa.com/data/details/traffic_details?url=#{site}"
  5.  
  6.         td "td[Traffic Rank for]" do
  7.           span ‘/span[3]do
  8.             html :type => :html_subtree
  9.           end
  10.         end
  11.     end
  12.     data = alexa_data.to_hash[0][:html]
  13.     data ? descramble(data) : nil
  14.   end
  15.  
  16.   def descramble(mess)
  17.     data = mess.gsub!(/<!–.+–>/,”).split(/<span class=(.+?)<\/span>/)
  18.     data = data.reject{|d| d=="" || (@hidden.include? d.scan(/"(.+?)"/).flatten[0])}
  19.     data.map!{|d| d.sub(/".+">/,”) }
  20.     data.join()
  21.   end
If the above code does not make too much sense to you, don’t worry - I will explain it in a follow-up post. What really matters is that it is possible (and relatively easy) to scrape alexa with scRUBYt! and Ruby.

Here is the extractor I have used to scrape technorati:

  1. def scrape(site)
  2.     techno_data = Scrubyt::Extractor.define do
  3.       fetch "http://www.technorati.com/blogs/#{site}?reactions"
  4.       rank "/html/body/div/div/div/div/div/div" do
  5.         rank_number /Rank: (.+)$/
  6.       end
  7.     end
  8.     techno_data.to_hash.reject{|h| h.empty?}[0][:rank_number]
  9.   end

Nothing special here - what you did not see yet in the other extractors is the use of a regular expression pattern (rank_number), which gets the number out of the corresponding element and leaves the trash behind.

The only manual step I had to make in this whole process was to tweak the alexa results a bit - because according to alexa.com, “Alexa’s traffic rankings are for top level domains only (e.g. domain.com). We do not provide separate rankings for subpages within a domain (e.g. www.domain.com/subpage.html) or subdomains (e.g. subdomain.domain.com)“. This in practice meant that I have checked all blogger.com, wordpress.org, jroller.com and similar domains whether they concrete pages are really that high ranked, or just their top-level domain is. This blog also made it to the top 30, but I had to throw it away because about half of the traffic goes to the scRUBYt! forum, and the remaining visits were not enough to stay inside the top 30.

If you are still with me, I guess you had enough of scraping, screwing, cleaning and assembling by now - and your ordeal is over! Here are the results:

Ruby/Rails blog roundup based on alexa.com traffic values

1 http://redhanded.hobix.com (59,519)
2 http://www.rubyinside.com (59,729)
3 http://hivelogic.com (73,585)
4 http://www.loudthinking.com (77,521)
5 http://mephistoblog.com (81,383)
6 http://weblog.techno-weenie.net (93,041)
7 http://slash7.com (98,732)
8 http://nubyonrails.com (99,955)
9 http://errtheblog.com (105,525)
10 http://rubyonrailsblog.com (117,842)
11 http://pjhyett.com (118,985)
12 http://www.railsenvy.com (123,414)
13 http://encytemedia.com (127,451)
14 http://www.igvita.com/blog (131,955)
15 http://blog.zenspider.com (133,281)
16 http://juixe.com/techknow (134,454)
17 http://peepcode.com (136,202)
18 http://weblog.jamisbuck.org (136,606)
19 http://project.ioni.st (149,352)
20 http://railscasts.com (154,733)
21 http://metaatem.net (155,901)
22 http://www.danwebb.net (162,936)
23 http://www.robbyonrails.com (181,590)
24 http://www.urbanhonking.com/ideasfordozens (184,096)
25 http://blog.codahale.com (188,987)
26 http://drnicwilliams.com (197,798)
27 http://topfunky.com (199,064)
28 http://blog.innerewut.de (200,308)
29 http://web2withrubyonrails.gauldong.net (203,938)
30 http://brainspl.at (209,069)

Ruby/Rails blog roundup based on technorati.com traffic values

1 http://www.loudthinking.com (3,164)
2 http://metaatem.net (5,173)
3 http://mephistoblog.com (6,223)
4 http://hivelogic.com (6,498)
5 http://www.rubyinside.com (7,899)
6 http://errtheblog.com (8,587)
7 http://nubyonrails.com (10,277)
8 http://quotedprintable.com (10,680)
9 http://slash7.com (11,154)
10 http://redhanded.hobix.com (11,154)
11 http://project.ioni.st (11,957)
12 http://weblog.jamisbuck.org (12,412)
13 http://blog.leetsoft.com (13,270)
14 http://weblog.rubyonrails.com (13,445)
15 http://www.danwebb.net (16,844)
16 http://www.igvita.com/blog (17,262)
17 http://www.chadfowler.com (17,262)
18 http://blog.codahale.com (17,601)
19 http://clarkware.com/cgi/blosxom (22,954)
20 http://drnicwilliams.com (23,098)
21 http://www.oreillynet.com/ruby/blog (23,346)
22 http://peepcode.com (24,393)
23 http://eigenclass.org (24,393)
24 http://mir.aculo.us (25,071)
25 http://on-ruby.blogspot.com (26,808)
26 http://antoniocangiano.com (29,929)
27 http://www.therailsway.com (30,395)
28 http://www.softiesonrails.com (31,790)
29 http://glu.ttono.us (33,550)
30 http://www.ryandaigle.com (34,998)

There were a few blogs which performed equally well on both lists - they were assembled into a top 10 list, which you can check out here.

I would really like to hear your opinion on this little experiment - whether you think it makes sense or it is completely off, how could it be improved in the future, what features could be added etc. If I’ll receive some positive feedback, I think I will work on the algorithm a bit more, and run it once in say every 3 months to see what’s happening around the Ruby/Rails blogosphere. Let me know what do you think!

Related Links

scRUBYt! wiki

We have decided it’s time to start a scRUBYt! wiki (thanks to Whitefolkz for coming up with the idea and starting to work on it). You can find the wiki here. Whitefolkz already posted some guidelines and additional info.

Please post your experience, extractors, guides, tutorials or any other nuggets of knowledge you think the others can benefit from! Thanks for your help in advance.