Archive for September, 2007
A Hot New Release, 0.3.4 is Out - What’s New?
After a long-long time, a lot of bugfixes, brainstorming sessions, coding, coding, coding, cans of red bull and coding, we are proud to present scRUBYt! 0.3.4!
Judging from the posts on the forum, people are not aware of quite lot powerful features (which is mainly my fault as I was lazy to do any documentation for the last 2 releases - but a cheatsheet and reference is on the way) - so I’d like to introduce a few new features which were added to scRUBYt! 0.3.4 to avoid this, at least for this release.
First of all there are 3 new pattern types, of which 2 are particularly interesting. Let’s start with the not-so-interesting one:
- Constant pattern:
Sometimes I needed a piece of text or data which was not contained in the web page (or it was always constant, so scraping it would mean an unneeded overhead) - perhaps a comment or a required field in a feed or other predefined schema. Constant pattern comes handy exactly in these cases: the above example will produce:
- pattern ’some constant text’, :type => :constant
- <pattern>some constant text</pattern>
The two interesting ones are in a very-alpha stage (in fact one of them was implemented 2 days ago for a scenario) so they are more of a preview of what to expect in the future releases than full-fledged features. They are already usable to some extent, but a lot of tweaking, polishing and adding new functionality can be expected in the near future.
- Text pattern:
A text pattern works differently than an XPath one: while the XPath pattern relies on the structure of the document, the text pattern doesn’t. This is essential in the case of some sites (the most typical example is perhaps wikipedia) which are not using a single template to present the content and/or the structure changes often, but there are some text labels or other constant text chunks which can aid the scraping. The semantic of the above example is:
- pattern ‘td[some text]:all’, :type => :text
Find all <td> tags which contain the text 'some text', wherever on the page.
I am sure you noticed the :all notation - currently :index (where index is a number, so :0, :1 etc.) is supported besides :all, meaning ‘give me the first (:0), second (:1) etc. occurrence of the match). A lot of additions can be expected for the text pattern in the future (for example give me the longest text in a <td> or give me <td>s with a certain regexp etc.). As always, suggestions are warmly welcome! - Script pattern:
A script pattern is a way to execute an arbitrary Ruby block during scraping. It’s input, as always, is the output of it’s parent pattern, represented by ‘x’. While this pattern type will be enhanced a lot in the future (allowing to choose more, arbitrary patterns as the input, possibility of specifying custom input, simplifying the syntax (the ‘lambda {|x| }’ stuff is constant so it will be most probably dropped) etc.) this pattern is already quite powerful as it is. The simplest use cases include filtering and modifying URLs, stripping white space or another string modifications like substitutions on the result, primitive branching etc. However, only your imagination is the limit here: you could do different operations on scraped prices, stock data, or running scraped coordinates through a geocoder. I am quite sure that script pattern will be a lot of fun, resulting in interesting uses.
- pattern lambda {|x| x.gsub(‘x’,'y’).downcase}
There are some additions to the output functionality: to_hash now accepts a custom delimiter (for the cases when the output contained a comma, the default delimiter) and there is a new method: to_flat_xml, which produces a feed-like, flat xml instead of the hierarchical output generated by to_xml.
Logging was reworked completely by Tim Fletcher. The most notable difference is that by default, you won’t be overflooded with all the debug messages pouring from scRRUByt!. To enable logging, you have to explicitly add a line before your extractor:
- Scrubyt.logger = Scrubyt::Logger.new
- #your extractor begins here
Last but not least, a lot of bugs were fixed: the infamous regexp pattern bug, the encoding bug (scraping utf-8 pages should be ok now), a lot of fixes in the download pattern and other places.
jscRUBYt! and firescRUBYt! are on the way, so stay tuned!
Let the Dogs Out!
scRUBYt! has dug its way into the investment business :). No joke - check out this scrummy tutorial created by Doug Bromley to find out more.
A refreshing mix of business and technology is served here: a nice scraper that returns dividend yields summarized all in one place and a useful example for form filling and submitting, page navigation and constrains.
Thnx Doug for the excellent job!
You are currently browsing the scRUBYt! blog archives for September, 2007.