scRUBYt!
WWW::Mechanize and Hpricot on Steroids

Briefly...
scRUBYt! is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. The idea behind making scRUBYt! was to show a few simple concepts of Web extraction as a practical extension of this tutorial.
February 21st, 2007 at 5:21 am
What can I say - My site seems to work now! (only did a quick check with my old script, but it looks promising).
Great project! All power to you.
February 21st, 2007 at 6:23 am
do you have any example with proxy setting,
I tried:
fetch url, proxy
where proxy=’http://proxy.name:port’
but I get the same error:
c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/navigation/fetc
haction.rb:109:in
parse_and_set_proxy': undefined methoddowncase’ for nil:NilClass (NoMethodError)from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/na
vigation/fetchaction.rb:26:in
fetch'fetch’from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/na
vigation/navigation_actions.rb:57:in
from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/sh
ared/extractor.rb:47:in
method_missing'define’from test.rb:22
from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/sh
ared/extractor.rb:27:in
from test.rb:21
shell returned 1
if you could post any example with http proxy, it would help
thanks
chris
February 21st, 2007 at 6:40 am
Chris,
It should be
Please report back if it works - I did just some dummy test so far, I am curious if it works with a ‘real’ proxy… Thanks!
February 21st, 2007 at 7:25 am
yes, it’s working, just one note:
in :proxy => must be ‘my.pro.xy:8080′
(not ‘http://my.pro.xy:8080′)
when I run an example misc/google.rb
I get following error (see below)
when I comment out the line:
next_page “Next”, :limit => 2
it works.
Thanks again for nice tool
[MODE] Learning
[ACTION] Setting proxy: host=, port=<80>
[ACTION] fetching document: http://www.google.com/ncr
[ACTION] typing ruby into the textfield named ‘q’
[ACTION] submitting form…
[ACTION] fetched http://www.google.com/search?hl=en&ie=ISO-8859-1&q=ruby
c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/shared/evaluati
oncontext.rb:73:in
generate_next_page_link': private methodgsub’ called fornil:NilClass (NoMethodError)
from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/sh
ared/evaluationcontext.rb:31:in
crawl_to_new_page'evaluate_extractor’from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/sh
ared/extractor.rb:103:in
from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/sh
ared/extractor.rb:101:in
evaluate_extractor'define’from c:/ruby184/lib/ruby/gems/1.8/gems/scrubyt-0.2.3/lib/scrubyt/core/sh
ared/extractor.rb:36:in
from test.rb:23
shell returned 1
February 23rd, 2007 at 6:37 am
again : thanks !
February 24th, 2007 at 9:58 pm
Since the tutorial for constraints is not yet written, I thought I’d post here.
Is it possible to ensure that a certain pattern has a certain value ? For example if I only want to extract the rows in a table with ‘Book’ as the value of the first column.
Also, it seems impossible to write custom ruby code inside the extractor. Is there any easy way to do so in order to include extra computed information in the xml file (Without generating the xml, and then reparsing using REXML, and rebuilding an xml file …). For example, let’s say I extract addresses, but I want to run them through my geocoder and include the coordinates in the xml.
February 26th, 2007 at 2:19 am
I have just added the tutorial, just I did not have time to announce it yet.
To answer your question, it is not yet possible since there are just some basic constraints ATM - however, it would be dead easy to add something like this, so I will do it if I will have a little time.
Your second question: yeah, working on that. The stuff you require is called ’script pattern’ which executes a chunk of Ruby code on the input of the pattern, then can have child patterns as any other normal pattern etc.
If you have time, you could add these to the rubyforge tracker to make sure I don’t loose track of them…
February 26th, 2007 at 7:22 am
I’ve just added them to the feature request tracker.
http://rubyforge.org/tracker/index.php?group_id=2836&atid=10927
I seriously love this project, I can’t wait until it has more options ! Right now I’m writing extractors for just about anything, because it’s simply fun
I posted a comment rather than a submission before, because I wasn’t sure if it was already implemented or not.
Keep up the great work !
February 26th, 2007 at 7:35 am
Thanks, skwid!
I am just setting up a forum as a better mean of communication (i.e. better than WP comments and PMs etc), and if I will have a bit of time I would like to setup CMS for exchanging, rating, tagging etc. extractors, so you will be able to share your extractors with the world!
Of course if there are any issues, be sure to tell me about them! Thanks in advance.
February 28th, 2007 at 12:41 pm
Good. I accidently bumped mechanize to 0.6.5 and had som problems with the new examples. then i uninstalled and reinstalled 0.6.3 again but the problem persists. see output for the google
Many of the other examples that I tried sofar are working ok (with mech0.6.3 and hpricot 0.5.0.
March 3rd, 2007 at 6:06 pm
I have skimmed through many web pages, how to’s and editorials., Scrubyt is a great looking gem that does what i think i want it to do with very understandable coding and explanation. So that said I have tried to install ruby rubygems and all sorts of other known beasts to run this Scrubyt. It is now day 4 of this fight with the Terminal, i am tired why am I getting the following error ?
LoadError: no such file to load - scrubyt
method gemoriginalrequire in customrequire.rb at line 27
method require in customrequire.rb at line 27
at top level in untitled.rb at line 7
I have loaded Scrubyt somwhere
thank you in advance
March 6th, 2007 at 3:41 am
staffan:
Sorry for the late reply, I have been busy recently.
Yes, this is a known bug which will be fixed in the next, upcoming release,
itay:
Is there somebody on MAC with similar problems? Unfortunately I don’t own a MAC so I don’t have an idea what’s going on…
March 9th, 2007 at 8:45 am
Does Scrubyt support htaccess authentication? (And, if so, could you provide an example, please?)
March 9th, 2007 at 8:52 am
Zed,
This is not yet possible in 0.2.3, however it was already requested here:
http://rubyforge.org/tracker/index.php?func=detail&aid=9121&group_id=2836&atid=10927
and most possibly it will be included in the next release, 0.2.5 which should be out around the next weekend.