Given that I’m looking for a new apartment, and I live in Ireland, I use the property search website Daft.ie. Everyone does. However I wasn’t very happy with how slow it is to scan through the many results that match my meagre budget. I realised that it could be readily fixed with GreaseMonkey, using the Dojo Ajax Toolkit to make life easier when it comes to parsing the page, adding effects etc.
The result is DaftMonkey.
I wasn’t even sure if Dojo could be used from within a GreaseMonkey script, as it sandboxes away the custom script code. However, with a little hackery it was (more or less) possible. The steps I took were:
- Set up the djConfig parameter in the host window to tell Dojo that the page had already loaded, using unsafeWindow.djConfig = {afterOnLoad: true};. unsafeWindow is what GreaseMonkey calls the normal, non-sandboxed window.
- Added the <script> tag for dojo.js to the head of the document. In this case I used the dojo.js.file hosted on AOL’s CDN servers – see
http://dev.aol.com/dojo
. - Now you have to wait for Dojo to load. This can be done with a simple setInterval function call, checking if unsafeWindow.dojo exists or not. (Update: thanks a comment from James, this has been changed to use the djConfig.addOnLoad function)
- Once Dojo is loaded, you can call a function kicking off whatever it is that you script is supposed to do. In this case, I wanted to add a bunch of DOM nodes to the page (which you can do without Dojo), and add some cool effects, so I also included the dojo.fx bundle.
- Copy the dojo variable back into the sandbox window using var dojo = unsafeWindow.dojo, otherwise you’ll have to refer to it as unsafeWindow.dojo all the time.
Screen Scraping With dojo.query
A lot of the features of DaftMonkey rely on asynchronously fetching remote HTML pages and scraping the required data from them. The approach I used for this was:
- Perform a remote request using GreaseMonkey’s native Ajax function GM_xmlhttpRequest. This works more or less the same as dojo.xhrGet, and I saw no reason to not use it.
- When the text is returned, create a DIV, and absolutely position it far to the left. Fix it’s size to just one pixel so it doesn’t mess with the scroll bars.
- Set the innerHTML of the DIV to the text you have retrieved. Congratulations, you can now use dojo.query to find whatever nodes you need. e.g. to find all images inside anchor tags, use dojo.query(“a img”, tempDiv). Note the second parameter, this tells Dojo to only search inside the temporary DIV we created, and not the whole document.
Some other site-specific things were required as part of the screen scraping process. Many of the sites had iframes included, and as soon as you add those to the temporary DIV, they start loading another page. This was a nasty performance hit, so I had to remove them from the HTML string before setting the innerHTML of the temporary DIV.
Problems
One problem I found is that calling dojo.declare didn’t work from inside a GreaseMonkey script. I don’t know why. Therefore widgets had to be defined the old fashioned way.
A second problem was more related to the website I was writing the script for, Daft.ie. The entire site is programmed using TABLES! Seriously, there’s barely one or two DIVs on the page, with practically no CSS either. This makes it quite difficult and brittle to screen scrape using dojo.query, as there’s really no classes to match. Still it was possible, but could break relatively easily if the site layout is changed.
Get the Source
You can get the entire source for the script at
http://userscripts.org/scripts/show/41105
.
To read a bit more about DaftMonkey, I’ve put up a page about it at
http://www.chofter.com/apps?n=daftmonkey
.
