Jan 152014
 

Search team is doing crawling and some web sites are heavily using JavaScript to generate content. Whenever I said “heavily” I mean none of the UI elements was from HTML, instead, JavaScript runs after the page loaded, then shown to users.

I’m doing a prototype so that they can take as a reference and later on do something fit into their system better. The prototype as based on PhantomJS, it was in Ubuntu (12.04 LTS) repository which makes my life much easier. Again, I need to install xvfb so that I can run X-based application in command line.

After everything got installed, simply edit a JS file like this:

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36';
page.open('https://www.google.com/webhp?hl=zh-CN&sourceid=cnhp#hl=zh-CN&q=%E5%BE%B7%E4%BC%AF%E5%AE%B6%E7%9A%84%E8%8B%94%E4%B8%9D&safe=strict', function(status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
window.setTimeout(function() {
var dyn_content = page.evaluate(function() {
return document.getElementById('rhs_block').innerHTML;
});
console.log(dyn_content);
phantom.exit();
}, 1000);
console.log('time again');
}
});

Then the element will be printed out on the screen …

I still need to figure out a better way to get rid of codes like “setTimeout(…, 1000)” as I believe it slows down the processing quite a lot, but anyway, it’s a prototype …

Sorry, the comment form is closed at this time.