scRUBYt!
WWW::Mechanize and Hpricot on Steroids

Briefly...
scRUBYt! is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. The idea behind making scRUBYt! was to show a few simple concepts of Web extraction as a practical extension of this tutorial.



August 16th, 2007 at 7:28 am
XPather? DOM Inspector? Sounds like you need to be introduced to Firebug
Looks good though — bookmarked!
August 17th, 2007 at 1:23 am
Tony,
I am using Firebug during Rails development a lot, but I could not find the same functionality in it (at least not in the same way) as in XPather. Maybe it’s there just I did not found it.
What I would want:
- to be able to click on an element in the browser and get it’s XPath
- to be able to enter an XPath and get the result nodes; the possibility to blink them in the browser
- Firebug seems to be too heavyweight for these. I guess it can do the above tasks, just in a different way.
So, if someone points out the way to do the above things, I will maybe completely switch to Firebug in the future (also for the tutorials).
August 19th, 2007 at 7:38 am
This is great stuff. It’ll be a good intro to scRubyt’s power.
With regards to using Firebug: If you open Firebug and click the “Inspect” button (top left) it allows you to move around the page and it’ll highlight in red the parts you go over. If you click on something it ‘locks’ onto it and shows you the info in the Firebug window underneath. The Inspect feature is very similar to XPather.
Another good feature of it is that it’ll also show you what styles are acting upon that node.
August 19th, 2007 at 11:48 am
As a guy that runs a website that includes a search engine with syndicated results (from Infospace), I just thought I’d point out that scraping Google results /probably/ has some sort of legal issues around it. Certainly if you do it in any commercial or large-scale capacity, but technically even in this scenario it might be an issue.
Having said that, this is a beautiful article about an awesome piece of code. thanks
August 19th, 2007 at 6:22 pm
What I would want:
- to be able to click on an element in the browser and get it’s XPath
1) Put your mouse over the element you want. Right click and choose Inspect Element. The FB panel should open up and the element source should be highlighted.
2) Right click the highlighted element (or whatever element you want) and choose ‘Copy XPath’
1) Go to the Firebug Console (CTRL+SHIFT+L) and enter $x(’/your/xpath/selector’)
Ex
$x(’/html/body/div[2]/div/div[2]/div/div[2]/ol/li[2]’)
or
$x(’/html/body/div[2]/div/div[2]/div/div[2]/ol/li[2]/p’)
The displayed results will be clickable.
August 19th, 2007 at 6:26 pm
Sorry for the poorly formatted comment above.
There’s a great cheatsheet for Firebug at http://cheat.errtheblog.com/s/firebug/
August 29th, 2007 at 3:50 pm
What I don’t get is, if I have to enter the first search result into my scrubyt powered Ruby script to search for some google search results. How exactly does that help me? If the search result doesn’t exist anymore, the script will fail. If I search for another search term instead of Ruby, the script will fail. If I have to look at the search results to write the program that shows me the search results that I just saw, writing such a program isn’t really worth the trouble because I already saw the search results when I was on the website just a moment ago to look which would be the first returned search result.
Maybe it’s pretty obvious, but I just don’t get it. That’s why I still use hpricot for scraping websites. I hope you can help me to understand scrubyt more
And please don’t be offended by what I said, I know that scrubyt can do these things, I just don’t understand how. From what I saw on the website I really like where scRUBYt is going, it could be something of equal worth to a Ruby programmer as a swiss knife is to McGyver, at least if you’re developing an application that gets some data from the net (and by todays standards that means pretty much every app out there).
September 3rd, 2007 at 8:41 pm
ScrubyT has the ability to take a script with hardcoded training data and convert it into a generalized script that uses XPath expressions and doesn’t depend upon any specific text. (I was also confused on this point when I first read about it ScrubyT.)
September 25th, 2007 at 9:33 pm
Hi
Im a newbie and this might be a dumb question but where do I run this from. I have created an project ie rails mikesproject
I have created a controller
I have created a model
I have put this script into the controller directory and I have tried to run it (Im using Aptana) in mongrel and it just brings up the script from your example on the screen ie no listing like your example shows
Im lost
October 24th, 2007 at 6:01 am
[…] scRUBYt! » Scrapin’ Google in no sec - a Simple to Learn and Use, yet Powerful Web Scraping Toolk… (tags: ruby scraping google tutorial scrubyt programming scrape screenscraping tutorials web) […]
March 31st, 2008 at 5:31 am
[…] toolkit scRUBYt!, has put together a great article showing the process, from start to finish, of scraping Google results using Ruby “in no seconds”. In reality, it’ll take you at least sixty to read the post. Read […]
April 2nd, 2008 at 7:38 am
Is the code HTML or what language? What do I need to execute it? A browser or a special compiler?
April 2nd, 2008 at 10:24 am
@Hoo Woo Chi
scRUBYt! was written in the Ruby language. In order to execute it you do not need a special compiler or browser … run it as any other .rb file.
April 2nd, 2008 at 12:36 pm
@m4r14nn4
Sooo…. let’s suppose I have winxp and saved the code in a textpad as ruby.rb…
I know this sounds lame, but I know how to write and execute HTML and C++ codes. You either need a browser or a compiler.
Am I just Ruby —> http://www.littlefriendsranch.com/RubyRAug1906%20071.jpg
May 19th, 2008 at 12:25 am
Buy bargain deal online medications here - I can recommend it!
May 19th, 2008 at 12:25 am
check this out!
May 20th, 2008 at 5:18 pm
Wonderful web site, was very useful. Lovely touch having this guestbook. Thanks
May 20th, 2008 at 5:19 pm
The stuff on this web site is really witty and cool wise
May 22nd, 2008 at 7:10 am
qusct fjprys ulgsnetpb cnvuj wydxjqgko wzacnoq swkxu
May 22nd, 2008 at 7:16 am
jmtd fdcgls flymsp mpnbxv royklgm vobrgk mkhbs
May 22nd, 2008 at 7:17 am
madjhcqnu nsviepk hlgmunv irnowfhx aykeo iavtzsjnm auhbfjxkr http://www.cnubsrg.ipsozgwj.com
May 22nd, 2008 at 7:19 am
lqkfm uykfj quoamyz hzgfyi wfsxkybd zfmqu gwlrkapb
May 22nd, 2008 at 7:19 am
tjbg otgbcpsl wyit vucho rbzyifq hryak nscgz http://www.yufmt.pvan.com
May 27th, 2008 at 7:51 am
What a nice site, been surfing on it for the whole night and day and i neva got bored for a single minute. Keep up your good work and all of the best in everything you do!
May 29th, 2008 at 10:18 pm
I really like your site