Archive for February, 2007
Introduction to Constraints
The two most typical/trivial problems with a set of rules are that they match either less or more instances than you would like them to. Constraints are a way to remedy the second problem: they serve as a tool to filter out some result instances based on rules.
Constraints are nearly always applied after you run the extractor, observe the result and find out that it extracted more results than needed. Let me demonstrate this on an example (the example can be found among the standard examples in the buydig folder):
Let’s construct a simple extractor which scrapes the name and price of every camera!
- camera_data = Scrubyt::Extractor.define do
- fetch File.join(File.dirname(__FILE__), "input.html")
- item do
- item_name "Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT"
- price "$179.00"
- end
- end
That was easy - or was it? Let’s examine the output:
<root>
<item>
<item_name></item_name>
</item>
<item>
<item_name></item_name>
</item>
<item>
<item_name>Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT</item_name>
<price>$179.00</price>
</item>
<item>
<item_name>Canon Vertical Battery Grip BG-E4 For EOS 5D</item_name>
<price>$249.00</price>
</item>
<!-- ...
22 items omited
--> ...
<item>
<item_name>Canon EOS Digital Rebel XT Body (Black) - EOS 350D</item_name>
<price>$696.00</price>
</item>
<item>
<item_name>Shopping Cart</item_name>
</item>
<item>
<item_name></item_name>
</item>
<item>
<item_name></item_name>
</item>
<item>
<item_name></item_name>
</item>
<item>
<item_name></item_name>
</item>
<item>
<item_name></item_name>
</item>
</root>
item extracted 33 instances.
item_name extracted 33 instances.
price extracted 25 instances.
Well, something is not really right here. There are 25 records on the page we are interested in, but unfortunately there are more objects on the same XPath as our records; Those were extracted as well and littering out otherwise nice output. How do we remove them?!
As mentioned previously, inf an extractor returns too much results, we have to apply constraints - additional rules showing the system which results should be thrown out. There are currently only a few constraints implemented in scRUBYt!, but even from these we can use two types to solve our situation.
First let’s check the easier one (which will be used in such cases anyway - we will look into the other one for illustration purposes), ensure_presence_of_pattern. By observing the output, it is clear that we need only those <items> which have a <price>. The ensure_presence_of_pattern constraint, as its name says, is doing exactly this: returns only those element which have a specific child pattern. Let’s see the modified example:
- camera_data = Scrubyt::Extractor.define do
- fetch File.join(File.dirname(__FILE__), "input.html")
- item do
- item_name "Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT"
- price "$179.00"
- end.ensure_presence_of_pattern ‘price’
- end
If you observe the result now, you can see that it extracts only the correct items.
The second type of constraint is called ensure_presence_of_ancestor_node. It’s meaning is to accept only those results which have an ancestor HTML node with a given name and set of attributes. For example, if a pattern extracts a <tr>, and you add an ensure_presence_of_ancestor_node constraint to it with values :td (the suggested ancestor HTML node) and ‘colspan’ => ‘3′ (the attribute which has to be present), only those table rows will be returned which contain a <td> ancestor with the attribute ‘colspan’, where the value of ‘colspan’ is ‘3′. This may sound complicated for the first time, but it is a super-easy concept once you will get used to it.
So how do we apply this constraint to our cause? We can observe from the statistics that the prices were extracted correctly. So let’s check that HTML element in XPather. Open XPather and click e.g. the first price ($179.00) on the page. You should see something like this:
You can observe that the price is inside a element and it has an attribute ‘class’ with the value ’searchProductPrice’. Therefore add an ensure_presence_of_ancestor_node constraint to the pattern ‘item’:
Since the ‘price’ pattern is an ancestor of the ‘item’ pattern, their HTML input chunks have to be in the same relation (i.e. the HTML input of ‘price’ is an ancestor of ‘item’), so we can tell ‘item’ that we need only those result instances which have a ‘price’ (which translated to HTML inputs means that only those should be extracted which have a ancestor with an attribute name ‘class’ and attribute value ’searchProductPrice’.
ensure_presence_of_ancestor_node has a negative counterpart: ensure_absence_of_ancestor_node, which rejects (and not accepts) results with such properties.
The last type of constraint currently supported is ensure_presence_of_attribute (which again has a ensure_absence version). It’s parameter are “attribute_name” and “attribute_value”. If this type of constraint is added to a pattern, the HTML node it targets must have an attribute named “attribute_name” with the value “attribute_value”.)
As a mean to reject the unneeded results, the concept of constraints is quite powerful - however, much more constraints will have to be implemented to really leverage it’s power.
The new release, 0.2.3 is out!
Thanks to the feedback from all of you, I managed to find a lot of bugs as well as write up a nice feature request list. The bugs are mostly fixed and also some shiny new features have been added. Stability was also improved by adding new tests and totally refactoring the whole code.
The new features make this release much more powerful than the previous one. Sites requiring login, submitting forms with button click, filling text areas, dealing with variable-size results, smart handling of attribute lookup, https, custom proxy setting and tons of bugfixes make this release capable of doing much-much more than it was possible in 0.2.0.
I have added also some shiny new examples - scraping reddit, del.icio.us, rubyforge login, wordpress automatic commenting for example. Full changelog:
* [FIX] Cookies (and other stuff) are now taken into consideration
* [NEW] select_indices feature. Example:
table do
(row '1').select_indices(:last)
end
this will select only the last row; possibility to specify a Range, or an array of indices,
or other constants like :first, :every_odd etc. More to come in the future!
* [FIX] digg.com next page problem fixed
* [FIX] Fetching of https sites
* [FIX] Next page works incorrectly when given an absolute path
* [FIX] Fixing exporting if the pattern parameters are parenthesized
* [NEW] Possibility to submit forms by clicking a button
* [NEW] Added new unit test suite: pattern_test
* [NEW] Possibility to set a proxy for fetching the input document
* [NEW] Added possibility to choose an option from a selection list
(Credit: Zaheed Haque)
* [FIX] Image pattern example lookup fix
* [NEW] Possibility to prefilter the document before passing it to Hpricot
(Credit: Demitrious Kelly)
* [FIX] corrected gem dependencies (Credit: Tim Fletcher)
* [FIX] remove duplicates only if there are more examples present
* [NEW] new examples: wordpress comment (Credit: Zaheed Haque), rubyforge login,
del.icio.us, reddit and more
* [FIX] if there is no scraper defined, exit with a message rather than raise an exception
* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but
if it is not there, traverse up until it is found (this is useful e.g. if an image is
inside a span and the span is inside an <a>)
I will write also new tutorials (also for the new features), since they were requested by more people.
Please keep the feedback coming! Thanks for everyone for contributions, suggestions and every kind of communication.
Problems with scRUBYt! on Ubuntu
I am using Ubuntu from warthy warthog and since I installed it for the first time, it’s my distro of choice. I could fill a whole blog with posts on why I like Ubuntu - however, a post about painless installation of different development tools won’t be one of my topics there. As much as I like Ubuntu, I have to admit that it is definitely not a distribution geared towards development. I remember when trying rails for the first time, I had to compile something and it took 2 days to find out that Breezy does not have make by default!
OK, this is not a blog post about development woes on Ubuntu in general - I would like to restrict myself to the problems which are scRUBYt! related. Thanks to the guys who reported these problems btw. (Dboy, Jason Evans, Richard Musiol - sorry if I left out someone).
The problem nearly always manifested in the message:
uninitialized constant Scrubyt::Extractor (NameError)
So far, there were 2 solutions that helped: one was to install rdoc and ri - since installing Ruby does not automatically install these packages. The other problem is Mechanize related: it is needed to install libopenssl-ruby.
If you find any other problems, please report it here (add it as a comment and I will update the post). Thanks! Happy scrubbing!
Yummy cookies coming soon!
Judging from your mails and bug reports, the biggest missing thing at the moment is that the cookies are lost during the navigation (which causes for example that after logging in to a site, you can not scrape the page). More people have been asking if this is the intended behavior, and if not, how easily can it be fixed.
Of course this is *not* the intended behavior, but a bug in the present version which will be fixed in the next release. The problem at the moment is that before the scraping, Hpricot loads the page through open-uri (rather than taking it over from Mechanize) and obviously open-uri does not know anything about Mechanize’s cookies.
Fetching every page with Mechanize should automatically fix the https problems as well. I am planning to introduce new options like possibility to set a custom proxy, custom user-agent string and more to further tweak the page loading.
Please note that even these improvements will still not allow to login to javascript sites - like google analytics. I will need to add FireWatir to scRUBYt! for that. This is also on the TODO list, but not for the next release.
Two New Tutorials Added
I have added two new tutorials which should eliminate a lot of confusion and make you an even better scRUBYt! hacker.
scRUBYt! development continues in full steam - thans for all the feedback, I really appreciate it! The next minor release, 0.2.1 is on it’s way, with nearly all the bug fix requests I have received up to date, and also some new and shiny features and enhancements (which will make scraping del.icio.us and reddit possible - with 0.2.0 they are very cumbersome to scrape, so I have added a new feature which should remedy this situation).
I am also continuing on with the tutorials. With 0.2.1, some new ones should arrive, and I am also finishing the original ones.
Thanks for all the feedback everyone! I am really impressed by all the communication, bug reports, enhancement request and questions which proved that you are really using scRUBYt!, and I am maximally happy about this. Please check back often or subscribe to scRUBYt!’s feed in order to keep on with the development - I will always publish here any important info, news, development status etc.
Using Different Types of Examples
scRUBYt! is working based on the examples specified by the user. Every pattern has one or more examples to specify what should be extracted by it. You can specify different types of examples, not just strings found on the page - though you will use that for most of the time, so this is the default example type.
There are 2 basic types of patterns: tree and string. A tree pattern evaluates to an HTML region (an HTML element which can have more child elements etc. - so a HTML tree, hence it’s name). A string pattern evaluates to a string.
Do not confuse the pattern type with the example type. There are just two types of patterns, but 6 types of examples. 4 of them will create a tree pattern and 2 of the a string pattern.
The example types are so far (in the parentheses, the pattern type is shown):
- String from the page (tree) - You have mostly seen this in action, so probably if you think about an ‘example’, this type of example springs to your mind. However, specyfying this type is by far the trickiest, and a handful of things have to be kept in mind - see example specification from the page - known issues & pitfalls page’ for this.
Though this example type is the most commonly used, the other ones can also come handy:
XPath (tree) - if you would like to extract something with XPaths on your own rather than leave it to the system to figure it out from a page example, you can use an XPath example. An XPath example should always begin with a slash ‘/’. There is no need to specify the example type explicitly here - scRUBYt! will figure this out automatically. In an exported extractor, all the ’string from the page’ examples are replaced with XPath examples, so check out any exported extractor (which originally contained at least one ’string from the page’ example) to see a concrete example.
Attribute (string) - Extracting attributes of the parent pattern. If the parent pattern is a tree pattern, you can extract attributes from it. Let’s revisit the google example, extracting also the URLs of the result pages this time:
- require ‘rubygems’
- require ’scrubyt’
- google_data = Scrubyt::Extractor.define do
- fetch ‘http://www.google.com/ncr’
- fill_textfield ‘q’, ‘ruby’
- submit
- link "Ruby Programming Language" do
- url "href", :type => :attribute
- end
- end
- google_data.to_xml.write($stdout, 1)
If you would like to try this example, download it from here.
I think this is pretty straightforward again: we have instructed scRUBYt! that the ‘url’ pattern should extract the href attribute of the ‘link’ pattern.
Image (tree) - If you want to scrape an image tag (for perhaps further extracting it’s dimensions, alternate text, or another attribute by specifying an attribute example type) just specify it’s src attribute as an example (which can be easily acquired from your browser - just right click on the image and choose ‘copy image location’ or similar - well, if the page does not use relative URLs to the images. If it does, you should better look it up in the source). If something does not go really awry, it is not needed to specify the type explicitly - scRUBYt! will figure it out based on the image extension. If your image has some very exotic extension (or has no extension at all), use :type => :image.
To see an image pattern, in action check out the us1camera example in the official example set of 0.2.0!Regular expression (string) - Take the parent pattern’s textual content, and scan() the results of this pattern with the regular expression provided. Again, let’s see an example:
- table do
- row :generalize => true do
- cell ‘1, 2, 3‘, :generalize => true do
- numbers /\d+/
- end
- end
- end
(The full example can be found in the examples - it’s the misc/tables/anotherplaintable example).
:generalize will be covered later (for the curious, it means: extract all the rows of a table, not just the concrete one defined by the example). The interesting part here is the ‘numbers’ pattern. For the input ‘1, 2, 3′ it will generate an output like this:
<cell> <numbers>1</numbers> <numbers>2</numbers> <numbers>3</numbers> </cell>
- No example (tree) - In this case the system knows that this pattern is used for grouping it’s child patterns together (for example an address is grouping together Street, Number, City, ZIP, country etc, or a book has a title, author, ISBN) and generates the rule based on the child patterns.
I guess this tutorial was a bit heavy, but don’t worry if not everything was clear - you will understand everything when creating extractors, and you can always refer back
Example Specification from the Page - Known Issues and Pitfalls
At the moment, when providing examples from the page, you must specify the whole text content of a node. This sometimes causes trouble in the present version (for most of the time not, these are more like hints and tips if something is not working out):
- The rendered element and its actual source code may be quite different: the text of the element in the page source may be split up between 2 elements and still look like 1 element on the page, may be formatted with a lot of whitespace which is rendered differently in the browser, or may be mixed up with other elements, images whatever. If scRBUYt! stops with a ‘FATAL: Node for example #{text} Not found!’ message, and you think that your text is there, use XPather to find out what’s happening: click on the node, then observe it’s textual content.
Let’s see an example, this time from amazon:

It looks as if the textual content of the element next to ‘You save’ would be “$17.00 (34%)”. The problem is that this is not really so. Check it out in XPather by clicking on the node and observing it’s text content:

As you can see, there is a lot of additional whitespace which is invisible on the page - however, currently it fools the system, because in the source code it is different, as we have seen in XPather.
In the future versions, problems like this will be solved by adding the possibility of specifying an example with XPaths, trough containment and other rules (see the conclusion at the end). - If you are grouping together more examples (for example an item name and a price into a web shop item) you have to be sure all the examples are the first occurrences on the page. En example is worth thousand words (click on the image to view the original size:
Let’s say you would like to extract the title of the article and the number of diggs, so the pattern structure will look something like:
article do
title
diggs
end
So, you can specify these example pairs:- Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1 as an examle of ‘title’ and 22 diggs as an example of ‘diggs’ - because ‘22 diggs’ is the first occurrence of that string on the page
- Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails as an examle of ‘title’ and 35 diggs as an example of ‘diggs’ - because ‘35 diggs’ is the only occurrence of that string on the page
but not - Get more data comparison options in MySQL with operators you may not know as an examle of ‘title’ and 22 diggs as an example of ‘diggs’ - because ‘22 diggs’ is NOT the first occurrence of that string on the page
So, be sure to choose the first occurrence of the string on the page as an example!
- always make to sure that all the examples exist in the very moment when you are launching the learning extractor!
I could mention the digg example again: I was constructing an extractor on digg, and I could not get it work for all the tea in China. After some minutes of banging my head against the wall I have noted that the problem was rather mundane (i.e. not hidden in the deep pitch of metaprogramming logic or a similar cool place): I have specified the number of diggs, launched the extractor, but meanwhile somebody dugg the article, so the count of diggs example was no more valid. It was very similar with ebay - the price examples were ‘corrupted’ there very fast because of bidding.
You can workaround these cases in the following way:
- taking a snapshot of the page (by saving it from your browser, or if the page is really Ajaxy and all that jazz, you could try a Firefox plugin like this one)
- picking an example from a page that is not likely to change (e.g. digg - go to the 10th page and choose an example here, or ebay - choose an item which has a lot of time to go)
- Mixed content: I have received a problematic case from a sCRUBYt! user. The problem was that the example content was mixed with an image:
TODO:: Snapshots to come… Possible soulution: additional possibility of specifying the example in a more sophisticated way, like for example:
:begins_with => 'Landed', :ends_with => 'Succesfully'
Conclusion: In the future, I would like to beef up example selection with multiple rules instead of a single string. I can imagine something like:
airport_record [:begins_with => 'Landed', :ends_with => 'Succesfully, :matches_regexp => /\w+\d/]etc.
The rule of thumb is: observe, observe, observe everything with XPather, and if you are still sure there is a problem, file a bug report or add an enhancement request at scRUBYt!’s rubyforge tracker.
Important: Exporting Extractors!!!
I have received about 4 mails and comments just today with this problem, and also found it in a comment on Peter Cooper’s blog entry where he announced scRUBYt!:
This is rather impressive. The end result could just as well be done with Curl, but this way, it is a lot clearer to understand in the source. On the downside, this script will stop working when pricing changes or the item does not show on the first page any more.
Fortunately, this is not true. I am going to update the tutorial with this important info as soon as I will have a minute - at the moment I have time just for this quick message.
Until then, just quickly:
The examples are used just to learn the rules of extraction - after this they should be discarded!
Suppose you have an extractor named my_extractor. Then simply do this:
my_extractor.export(__FILE__)check the directory from where you launched the script, and voila! there should be a file named
my_extractor_exported.rbThat should be it. All the examples are gone, and the page can change as much as it wants, the extractor will happily work!
More on this in one of the next tutorials which are coming soon.
More feedback required
Just 2 days after it’s first public release, scRUBYt! got a lot of attention: It was featured on rubyinside.com and another blogs, climb to the first place at del.icio.us/popular/ruby and even entered del.icio.us/popular (currently bookmarked by 134 people), and who knows whatever else. I am really happy about this, to say the least!
What I am not really happy about is the feedback (yet) - O.K. maybe I am a bit impatient and it will come anyway as more and more people will use and actually like scRUBYt!. I am sure that you found bugs, missing features (some of the most pressing ones were already reported in the comments). I would definitely like to hear these from you so I can make web scraping with scRUBYt! an even better experience. There are more ways to communicate:
- submitting bug reports and feature requests: please use the rubyforge tracker
- commenting on any of the posts
- sending a mail: scrubyt@<no|spam>scrubyt.org
The other way of contributing is to sending your extractors so I can put them on the site - in the future, I would like to set up a CMS just for this purpose, but until then, they will be posted to a static page.
Thanks for your support and encouragement!
Social stuff
Automatic wordpress comment by Zaheed Haque:
require 'rubygems'
require 'scrubyt'
post_comment = Scrubyt::Extractor.define do
fetch 'http://scrubyt.org/your-first-extractor/'
fill_textfield 'author', 'Zaheed Haque'
fill_textfield 'email', 'zaheed.haque@gmail.com'
fill_textarea 'comment', 'Posting a comment using my first
scRUBYt script. Lets see if it works I will publish the code snippet
here'
submit
end
You are currently browsing the scRUBYt! blog archives for February, 2007.