scRUBYt!
WWW::Mechanize and Hpricot on Steroids

Briefly...
scRUBYt! is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. The idea behind making scRUBYt! was to show a few simple concepts of Web extraction as a practical extension of this tutorial.

June 10th, 2007 at 10:45 pm
subscribers would be an important metric of “importance” too. Do any of the large reader sites (bloglines, google reader) publish # subscribers for their user base?
June 10th, 2007 at 10:46 pm
Perhaps you could publish an OMPL file of these blog’s feed urls; for new people.
June 10th, 2007 at 11:27 pm
1): Great idea! I have been thinking about scraping feedburner stats, but not everybody is using feedburner, so I discarded this idea - but blogines/google reader might actually work! Going to check it out.
2): Yet another great idea - please don’t stop generating them :-). I’ll do this as soon as I’ll have a little time.
June 13th, 2007 at 9:52 am
I’d definitely place subscriber numbers over both Alexa and Technorati. As to why, I have a blog that has virtually nothing on it, but it has an Alexa ranking of somewhere around 100k, and a Technorati rating of 12k. It also probably has 20 subscribers, if that! Now if it happened to have some posts on Rails (and was listed on any of the blog aggregators), it’d stand a reasonable chance of ending up near the top 10. It’s really easy to scam those sorts of numbers, whereas subscriber numbers don’t usually lie (or at least it takes a little more effort to scam). This is the type of thing piss poor SEO types love to take advantage of. Don’t give them that chance
June 14th, 2007 at 1:34 am
This is a great idea, especially if you’re able to add stats on subscribers.
How about publishing the top 100 to inspire those of us who blog rarely to try harder and climb the charts.
June 14th, 2007 at 1:56 am
@The Dude:
Well, such bogus sites
1) Don’t have Ruby/Rails posts
2) If they do, they don’t have good ones (maybe if they plagiarize) - but in either case, they won’t make it to the aggregators, or will be thrown out sooner or later.
While I agree with you that it is easy to scam alexa and technorati, I dont’t think so anyone wants to scam the ruby/rails aggregators (OK, if someone wants he can do it, but why should?)
June 14th, 2007 at 2:01 am
@Old Dog:
Yeah, I am thinking about something like that. Most probably I will join the rubycorner.com guys and do something like this together with them.
June 29th, 2007 at 6:34 am
Doesn’t seem to be working anymore. I get this error message:
/usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.0/lib/scrubyt/utils/xpathutils.rb:110:in
traverse_up_until_name': The element is nil! This probably means the widget with the specified name ('td') does not exist (RuntimeError)findnodefromtext’from /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.0/lib/scrubyt/utils/simple_example_lookup.rb:32:in
from /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.0/lib/scrubyt/core/scraping/filters/treefilter.rb:67:in
generate_XPath_for_example'initialize’from /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.0/lib/scrubyt/core/scraping/pattern.rb:110:in
from /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.0/lib/scrubyt/core/scraping/pattern.rb:109:in
initialize'methodmissing’from /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.0/lib/scrubyt/core/shared/extractor.rb:115:in
from ./alexascraper.rb:12:in
scrape'define’from /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.0/lib/scrubyt/core/shared/extractor.rb:32:in
from ./alexa_scraper.rb:9:in `scrape’
from blogranker.rb:10
from blogranker.rb:9
July 8th, 2007 at 7:45 am
Hello
I can’t be bothered with anything these days, but shrug. I just don’t have anything to say recently.
G’night
August 30th, 2007 at 4:13 am
hi,
i ve been using this url http://data.alexa.com/data?cli=10&dat=s&url=
to scrape datas from alexa, and unlike the main site, it’s a feed and the traffic rank is saved in the TEXT field under POPULARITY URL tag.
esh
August 30th, 2007 at 7:30 am
You can access the Alexa data easier w/o scraping using the free AWIS4Ruby project:
http://labs.votanweb.com/awis4ruby/
September 3rd, 2007 at 12:58 am
Jay,
Thanks for the link.
It’s a well known fact that in todays era (web2.0, WS, RSS/ATOM, REST… etc) the more modern sites are usually providing different methods (usually cleaner and more effective than scraping) to access their data. The aim of these tutorials is not to promote scraping over these techniques - it’s just that people are more interested in scraping google, alexa, amazon etc. than some old Web1.0 sites, even in the case that in real life, probably 90% of your scrapers will be targeted at those…
September 22nd, 2007 at 4:24 am
[…] site RubyRailWais listou os 10 blogs mais populares do momento, de acordo com o ranking da Alexa e Technorati . Veja as sugestões e atualize seu […]
December 23rd, 2007 at 6:48 am
look at it first time sex video
May 20th, 2008 at 3:11 pm
wtcubijsy ebjcuzik hiclqynu xojen sgbf xaiu cjbfzgyit
May 20th, 2008 at 3:13 pm
ecru gtmhp hesnvcm qtcln nhrfwktx ozgq yprjab http://www.zkmjdo.bymqj.com
May 20th, 2008 at 3:14 pm
hkeosfwp dehxfabc ufoadq xqleiwygk dyxvsfl nwgodltq xpvtouhn hbudwpgv jiwe
May 20th, 2008 at 3:16 pm
fuvxazq oxterzq fgdeqltxw loak fvkehtgc qtboy zmevjwu
May 20th, 2008 at 3:16 pm
mspvoa kosv gmqkjcws wpsyjxrh yolecmxt liysxkn civp http://www.eocr.emcagl.com
May 20th, 2008 at 3:18 pm
lyjh vlhidpa tzvdfai pzhqr xsjzfkcpd rgxz sxngufez
May 20th, 2008 at 3:18 pm
cwyzu voax nerhdoi gtcqy pxbjrnwm uwpeqi edlmy http://www.vzyokrfd.qsku.com
May 23rd, 2008 at 4:53 am
no Amantadine Amantadine parkinsons