Apr 242012
 

While I was migrating my web sites from AWS US east to US west, it’s interesting to see behavior of those spiders, good and bad.

I set TTL of my domain to 30 minutes, and after 12 hours some spiders are still NOT going to my new sites:

  • Baidu – Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
  • Netease – Mozilla/5.0 (compatible;YoudaoFeedFetcher/1.0;http://www.youdao.com/help/reader/faq/topic006/;1 subscribers;)
  • Yandex – Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  • Tencent – Sosospider+(+http://help.soso.com/webspider.htm)

and among them Baidu and Netease were doing really agressive crawling for whatever reason … am I update my blog too frequently? ๐Ÿ˜€

And here is the good list:

  • Google – Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Google – Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=13667814432325378539)
  • Bing – Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
  • What are these???
    • Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.19; aggregator:Spinn3r (Spinn3r 3.1); http://spinn3r.com/robot) Gecko/2010040121 Firefox/3.0.19
    • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)
    • Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
    • BlogSearch/2 +http://www.icerocket.com/
    • magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

Weird thing is that I didn’t see Sohu and Yahoo over there, maybe Yahoo had stopped crawling (replace homegrown technique with Microsoft’s Bing?). Also, it seems Tencent has deployment issue as it’s crawling both the new and old site, seems like its crawlers are not in sync.

Will watch for another day then will shutdown the old box.

  One Response to “Spiders – good and bad”

  1. I’ve shutdown nginx, mysql, and pptpd, so that old box (in US east) became a zombie, will cut it off tomorrow.

    Baidu and Netease were still crawling, #@$%^&*()(&^%*^

Sorry, the comment form is closed at this time.