{"id":1616,"date":"2014-01-15T19:11:55","date_gmt":"2014-01-16T02:11:55","guid":{"rendered":"http:\/\/xiehang.com\/blog\/?p=1616"},"modified":"2014-01-28T11:02:25","modified_gmt":"2014-01-28T18:02:25","slug":"get-rendered-html","status":"publish","type":"post","link":"https:\/\/xiehang.com\/blog\/2014\/01\/15\/get-rendered-html\/","title":{"rendered":"Get rendered HTML"},"content":{"rendered":"

Search team is doing crawling and some web sites are heavily using JavaScript to generate content. Whenever I said “heavily” I mean none of the UI elements was from HTML, instead, JavaScript runs after the page loaded, then shown to users.<\/p>\n

I’m doing a prototype so that they can take as a reference and later on do something fit into their system better. The prototype as based on PhantomJS<\/a>, it was in Ubuntu (12.04 LTS) repository which makes my life much easier. Again, I need to install xvfb so that I can run X-based application in command line.<\/p>\n

After everything got installed, simply edit a JS file like this:
\n
\nvar page = require('webpage').create();
\npage.settings.userAgent = 'Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/31.0.1650.63 Safari\/537.36';
\npage.open('https:\/\/www.google.com\/webhp?hl=zh-CN&sourceid=cnhp#hl=zh-CN&q=%E5%BE%B7%E4%BC%AF%E5%AE%B6%E7%9A%84%E8%8B%94%E4%B8%9D&safe=strict', function(status) {
\n if (status !== 'success') {
\n console.log('Unable to access network');
\n } else {
\n window.setTimeout(function() {
\n var dyn_content = page.evaluate(function() {
\n return document.getElementById('rhs_block').innerHTML;
\n });
\n console.log(dyn_content);
\n phantom.exit();
\n }, 1000);
\n console.log('time again');
\n }
\n});
\n<\/code><\/p>\n

Then the element will be printed out on the screen …<\/p>\n

I still need to figure out a better way to get rid of codes like “setTimeout(…, 1000)” as I believe it slows down the processing quite a lot, but anyway, it’s a prototype …<\/p>\n","protected":false},"excerpt":{"rendered":"

Search team is doing crawling and some web sites are heavily using JavaScript to generate content. Whenever I said “heavily” I mean none of the UI elements was from HTML, instead, JavaScript runs after the page loaded, then shown to users. I’m doing a prototype so that they can take as a reference and later […]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[328,468],"_links":{"self":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1616"}],"collection":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/comments?post=1616"}],"version-history":[{"count":2,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1616\/revisions"}],"predecessor-version":[{"id":1639,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/posts\/1616\/revisions\/1639"}],"wp:attachment":[{"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/media?parent=1616"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/categories?post=1616"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/xiehang.com\/blog\/wp-json\/wp\/v2\/tags?post=1616"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}