Archive for the ‘News & Announcements’ Category
scRUBYt! gem on Github
I have finally fixed the scrubyt.gemspec and commited to github - so you can install scRUBYt! from the github gem, which I will update quite often (’real’ releases to rubyforge, with announcements etc. will happen much less frequently). So in case you would like to keep up with the newest stuff, get the lastest bugfixes and whatnot, be sure to follow scRUBYt! on github and install the newest gem.
You can do so by running the following (if you haven’t already):
- gem sources -a http://gems.github.com
and installing scRUBYt! with
- sudo gem install scrubber-scrubyt
If you do so right now, you will get version 0.4.11 which contains a number of bug fixes, so be sure check it out! More goodies to come soon.
Bye-bye Beast, Hello googlegroups
I am sure most of you noticed that the forums are down since some time - I put quite some energy into fixing the issue, but it’s a very old install (using the archaic fcgi way) using the old beast and I had no time to convert it to nginx/mongrel or phusion (besides the fact that beast is abandonware).
To make a long story short: I have created a google groups mailing list for scRUBYt! - please subscribe and lace your questions there. It’s not very likely the forums will be back (though it would be great to share all the info that piled up there in some way - will think about it) - a mailing list is easier for everyone.
At last: scRUBYt! 0.4.1 is out
After more than a year, I’d like to announce a new release of scRUBYt! and set “scRUBYt!”.is_vaporware? = false. w00t!
Thanks to Glen Gillen, it is possible now to use FireWatir as the agent for navigation, enabling AJAX/more robust scraping via Firefox/FireWatir.
Another big news is that the RubyInline, ParseTree and Ruby2Ruby dependency was dropped since we couldn’t solve this problem for win32 for one year. Yay for the windows users (and other OS users juggling various versions of the above stuff).
Of course a lot of bugs were fixed as well.
On the non-source code front, we have
- An all new homepage with all the useful links
- A github repository
- A small (at the moment only!) scraper repository
- A Lighthouse tracker
- TextMate bundle
and probably other cool stuff which I can’t remember right now! Will update the article later.
What’s next?
The biggest news is that scRUBYt! is going to be rewritten from scratch - the work has already been started by Glenn Gillen. scRUBYt! has grown too big for our taste, so we decided to start anew, aiming for 100% rSpec coverage, refactored code, speed/performance optimization and leaving all the cruft behind. So scRUBYt! 0.4.1, the last one based on the original scRUBYt! will be supported until the new, rewritten one (0.5.0) comes out and takes it’s place.
TextMate Bundle for scRUBYt!
As stupid as this sounds from the original author after countless hours of scRUBYt! usage and development, I still had to occasionally open some older scrapers to get the exact logger, exporter, clicklinkand_wait etc. syntax. Even though I know 95% of the possible commands, I thought it’d be great to speed up the typing time - a typical scRUBYt! extractor has tons of boilerplate code.
So I decided to create a TextMate bundle and host it on github. It’s pretty rudimentary right now, consisting of about two dozens of snippets, but hey, it’s a start.
I bet scgoog->TAB will become a big favorite right away (spits the classical google extractor example into your editor) - but there are other usable snippets included as well. With their help it’s literally possible to create a scraper in a few seconds.
If you have further ideas, would like to contribute etc. please drop me a mail (scrubyt -nice try spambot! NOT.- at scrubyt dot org).
scRUBYt! on GitHub and LightHouse
The release is almost ready - I have to finish one more important feature (print the generated XPaths for patterns which were specified by an example - Extractor#export() is not supported at the moment since RubyInline, ParseTree and ruby2ruby were dropped to make the installation smoother possible :-). Until it will be added back, at least you can substitute the examples manually - not a great, but at least working solution). Also I’ll test the whole stuff once again, fix some minor bugs, create some nice examples (if you have suggestions, drop me a comment and let’s see what can I do) - I guess this should be done until the weekend.
Until then, if you’d like to check out the present state, check out scRUBYt! on github and I have just set up a LightHouse tracker for scRUBYt! - if you find any bugs, have feature requests/ideas etc. just drop it there.
Installing FireWatir for Firefox 3.0+
After nearly a year of silence, a new version, firescRUBYt! (scRUBYt! integrated with FireWatir) is ready for release. In fact, firescRUBYt! is ready since a few months already - the problem was that FireWatir [1] was not ready for Firefox 3, (which I believe everyone is using for some time now) and forcing users to go back to FF2 just to try out the new release would have not been a bold move I guess :-).
To make a long story short: the release is coming in the next few days, until then get ready by installing the FireWatir (and jssh, it’s prerequisite). So let’s start with jssh.
Check out the files attached to the FireWatir project here (in case the link is broken, go to the FireWatir site and navigate from there). Select your Firefox version and OS (for example take this xpi if you are on OS X and using FF >3.0 - see all the combos in part 2) of the official installation guide) and install jssh which is an xpi file (unless opened automatically, open it with FF (File -> Open)). Restart Firefox after the installation to ensure that the add-on is activated.
To test whether jssh is working, close your current Firefox instance, go to the commandline and start FF with the -jssh option, i.e.:
- firefox -jssh
In a separate window, try to connect to Firefox via jssh with telnet:
- telnet 127.0.0.1 9997
You should see/try something similar:
- Macintosh-4:~ mbp$ telnet 127.0.0.1 9997
- Trying 127.0.0.1…
- Connected to localhost.
- Escape character is ‘^]’.
- Welcome to the Mozilla JavaScript Shell!
- > ‘Hello, world!’
- Hello, world!
- > exit()
- Goodbye!
- Connection closed by foreign host.
Which means jssh is properly installed! You are through with the hardest part. Now you need to install the firewatir gem:
- gem install firewatir
That’s it! - you are ready to roll with FireWatir! The official release of scRUBYt!, along with a few tutorials is coming soon - stay tuned!
My EURUKO 2007 slides
You can download my EURUKO (the European Ruby Conference) 2007 slides from here. Enjoy!
Announcing JscRUBYt! - no more win32 problems (?)
Thanks to Paul Nikitochkin a.k.a. pftg, scRUBYt! made a great leap to ensure win32 compatibility. Paul created JscRUBYt! - the JRuby version of scRUBYt! which should be easy to install under win32 even if you are not a level 64 microsoft compiling ninja (in fact, it requires no compiling, fiddling around with C/C++ or doing anything outside (J)Ruby-land (well, except of installing JRuby, of course)).
Please download JscRUBYt! from here and read the installation instructions written up by Paul.
Please let us know if you run into any problems and/or your experience using this package!
A Hot New Release, 0.3.4 is Out - What’s New?
After a long-long time, a lot of bugfixes, brainstorming sessions, coding, coding, coding, cans of red bull and coding, we are proud to present scRUBYt! 0.3.4!
Judging from the posts on the forum, people are not aware of quite lot powerful features (which is mainly my fault as I was lazy to do any documentation for the last 2 releases - but a cheatsheet and reference is on the way) - so I’d like to introduce a few new features which were added to scRUBYt! 0.3.4 to avoid this, at least for this release.
First of all there are 3 new pattern types, of which 2 are particularly interesting. Let’s start with the not-so-interesting one:
- Constant pattern:
Sometimes I needed a piece of text or data which was not contained in the web page (or it was always constant, so scraping it would mean an unneeded overhead) - perhaps a comment or a required field in a feed or other predefined schema. Constant pattern comes handy exactly in these cases: the above example will produce:
- pattern ’some constant text’, :type => :constant
- <pattern>some constant text</pattern>
The two interesting ones are in a very-alpha stage (in fact one of them was implemented 2 days ago for a scenario) so they are more of a preview of what to expect in the future releases than full-fledged features. They are already usable to some extent, but a lot of tweaking, polishing and adding new functionality can be expected in the near future.
- Text pattern:
A text pattern works differently than an XPath one: while the XPath pattern relies on the structure of the document, the text pattern doesn’t. This is essential in the case of some sites (the most typical example is perhaps wikipedia) which are not using a single template to present the content and/or the structure changes often, but there are some text labels or other constant text chunks which can aid the scraping. The semantic of the above example is:
- pattern ‘td[some text]:all’, :type => :text
Find all <td> tags which contain the text 'some text', wherever on the page.
I am sure you noticed the :all notation - currently :index (where index is a number, so :0, :1 etc.) is supported besides :all, meaning ‘give me the first (:0), second (:1) etc. occurrence of the match). A lot of additions can be expected for the text pattern in the future (for example give me the longest text in a <td> or give me <td>s with a certain regexp etc.). As always, suggestions are warmly welcome! - Script pattern:
A script pattern is a way to execute an arbitrary Ruby block during scraping. It’s input, as always, is the output of it’s parent pattern, represented by ‘x’. While this pattern type will be enhanced a lot in the future (allowing to choose more, arbitrary patterns as the input, possibility of specifying custom input, simplifying the syntax (the ‘lambda {|x| }’ stuff is constant so it will be most probably dropped) etc.) this pattern is already quite powerful as it is. The simplest use cases include filtering and modifying URLs, stripping white space or another string modifications like substitutions on the result, primitive branching etc. However, only your imagination is the limit here: you could do different operations on scraped prices, stock data, or running scraped coordinates through a geocoder. I am quite sure that script pattern will be a lot of fun, resulting in interesting uses.
- pattern lambda {|x| x.gsub(‘x’,'y’).downcase}
There are some additions to the output functionality: to_hash now accepts a custom delimiter (for the cases when the output contained a comma, the default delimiter) and there is a new method: to_flat_xml, which produces a feed-like, flat xml instead of the hierarchical output generated by to_xml.
Logging was reworked completely by Tim Fletcher. The most notable difference is that by default, you won’t be overflooded with all the debug messages pouring from scRRUByt!. To enable logging, you have to explicitly add a line before your extractor:
- Scrubyt.logger = Scrubyt::Logger.new
- #your extractor begins here
Last but not least, a lot of bugs were fixed: the infamous regexp pattern bug, the encoding bug (scraping utf-8 pages should be ok now), a lot of fixes in the download pattern and other places.
jscRUBYt! and firescRUBYt! are on the way, so stay tuned!
Let the Dogs Out!
scRUBYt! has dug its way into the investment business :). No joke - check out this scrummy tutorial created by Doug Bromley to find out more.
A refreshing mix of business and technology is served here: a nice scraper that returns dividend yields summarized all in one place and a useful example for form filling and submitting, page navigation and constrains.
Thnx Doug for the excellent job!
You are currently browsing the archives for the News & Announcements category.