Jul 062014
 

I’m working for the big chip company now, sound a little bit weird, though 😉 .

Anyway, I’m sitting in backend team focusing on DevOps, I guess I will be in this mixed role till we find a dedicated DevOps guy. I guess whenever that happens, I’ve already finished monitoring facility (plus logistical stuffs like on-call schedule, etc.), and should have finished the plan to migrate from Rackspace to AWS.

Everything works smoothly so far except git – I admit current company is using git in a modern way, but I don’t think previous company was doing something wrong. Anyway, I believe people do have different ideas of how to use git, I just have to fit into the company’s style.

I don’t quite like saltstack although I’m still trying to get familiar with it. However, before I raise this as a concern to the team, I’d like to make sure everything that saltstack is doing can be done by cfengine.

Ah yea, also need to evaluate Shinken as it’s a pure Python solution, and “we are a python house”.

Aug 282013
 

Per history, this had been there for several months, seems nagiosgraph’s rrd data updated  sometime, and it seems this related to upgrade.

I never paid too much attention on this as nobody else, except me, watching those performance data graphs. However, today I do need to solve this as it’s always bothering that you cannot get data whenever you do need it.

Long story short – and stupid me, obvious a while back whenever I applied my patch to nagios.cfg, I didn’t comment out the first process_performance_data=0, and obviously second process_performance_data=1 didn’t overwrite the first one. Also, even with process_performance_data=0 which disable performance data processing, nagios still processes performance data every time it got restarted, and this cause all mysterious problems.

I do believe nagios behaves wrong in this case – it’s a common sense that configuration can be overwritten by later one with same key, and if configuration tells you don’t do something, DON’T do it even in edge cases, otherwise people will get confused and made debugging really difficult.

Aug 012012
 

Be aware this is a joke. 😀

I had alert from a memcached service that mentioning for all the time that hit rate was too low, after checking it I found it was actually because the traffic was too low (the site was not in production yet), so how to solve this problem?

After thinking of various “serious” solutions, I finally turned to running a script, keep hitting the entries (actually, only one specific entry) in the cache, so to raise the hit rate to a level that makes Nagios stop alerting me.

OK, OK, I know this is bad, but fun, isn’t it?

Actually this causes another thinking – if there is a particular key having high miss rate, do we care? If we do then the monitoring mechanism won’t work for that case …

Jul 312012
 

Trying to setup Nagios to play with monitoring facilities, turned out there are way too many things are NOT running out of the box. I’m trying to write as much as I can remember, so that I don’t have to Google again next time I step into the setup task again. Sure, others may be befinited from this as well.

A brief intro about the environment – I have my monitoring node in EC2 in east coast, another 3 servers to be monitored in EC2 west cost, all four are running Ubuntu 12.04, plus another physical box sitting in a IDC in Beijing, China, running Fedora 14 (the owner does not want to upgrade for some reason). Almost all servers are running classic applications for Web, such as Nginx, mysql, etc. Other than those public services I also need to monitor system status like disk space, memory utilization, ssh liveness, etc. Continue reading »