Aug 222013
 

I’m working on a set of data refreshment scripts, which get data from file, do some transform, then send to a HTTP interface. Since the HTTP interface is kind of slow compare with reading and transforming data, I have several forked children processes to handle HTTP part.

Everything was done by Perl about 6 months ago, and everything seems good … until I started picking up Python. First sight is that Pyhon program is about 50% of Perl in term of LOC, which makes it easier to read, but seriously I don’t care about this too much as the logic is quite simple, however, when I tested Python programs and found that they are at least 50% faster than Perl’s, I felt nervous.

Two examples – Perl takes 13 seconds, Python takes 5, Perl takes 34 minutes, Python takes 10. Actually I’m really nervous at this moment thinking of my poor Python skills, I always worry if I made anything wrong with the translation (from Perl to Python), even I have verified result data for quite some times.

Will dig in after converting all scripts to Python.

Jul 222013
 

Playing with GlusterFS now, here’s the to-do list:

  1. Installation and basic configuration, plus getting familiar with command line utilities
  2. Set up RAID-10-like configuration, with geo-replication if it is possible
  3. Regular routine maintenance, haven’t got clear idea yet, but should include: expand a volume, shrink a volume, replace a brick in volume, re-balance data, convert another fs to glusterfs, recover from various disasters, etc.
  4. Performance testing with various scenarios, even people have certain number with them, include: mail server with maildir (large number of small files with small amount of concurrent access), file server (medium number of files and medium size with almost no concurrent access), video server (small number of large files with large amount of concurrent access), file based database like SQLite and BerkeleyDB (concurrent access with lots of seek operation), and RDBMS like mysql/postgre.

Sound like a great plan, right :D? Let’s see.

May 212012
 

I was running crawler-like application on my EC2 nodes, which grab a set of web pages, save them locally, then do some follow-ups, obviously I hit the bottleneck of the EC2 nodes.

It seems EC2’s storage, at least EBS performs quite bad, I don’t have comparison in number, but for the follow-ups jobs, which are nothing to do with network, EC2 is hundreds or thousands times slower than a regular modern machine.

I guess traditional applications on EC2 don’t rely on EBS too much – most data write to S3/RDS/SDB, etc. However, I do have to pay more attention on this as there is a friend gets to run some critical jobs and I recommended EC2 … I will do some more tests and post result here if I find anything signaficant.

Apr 172012
 

I’m talking about performance here, same program, same compiler options, gcc on CentOS6 generate much better executable than gcc on CentOS5, the gap is 10~15%.

I think gcc makes the most difference, as changing glibc etc. does not make too many difference. It seems gcc did pretty good job from 4.1 to 4.4, I will dig into details whenever I have time.

Apr 172012
 

I’m a strong supporter to STL since ideally you can get rid of new/delete and malloc/free so not worry about memory leakage, sure you still need to think about resource leak but that’s much easier to debug to me.

However, I’m stuck in performance issue not with STL, I know STL does a lot of copy operation but I never imagine the performance is this bad, I think the main reason is that I use too many std::string and that kind of copy takes time, but what else can I do, move to char *? That will get back to new/delete again.

I’m running oprofile now to see if there is anything else to me optimized – the goal is 400K bytes/second, and I’m at 310K, while old pointer-rich implementation was no less than 500K. If there is nothing to to tuned I will just drop this branch and keep previous ugly but efficient codes.

Jan 192010
 

I setup a testing environment on couple of company boxes to see how Cassandra performs with real machines (real here means powerful enough to be a data node), here are details of the environment:

  • Two client nodes, one server nodes, all are RHEL 4.x. I use two clients nodes as I found that during the performance test, single client machine is unable to generate enough load
  • All three machines are 8 cores/16G memory (well, memory is not a big deal for my tests)
  • Running Cassandray 0.5.0 RC3 (built from svn last night)
  • Client is using Python

Here is the graph for simple request (single key lookup):

It seems the result is pretty encouraging – query per second of the server is growing almost linearly, at about 5,000 QPS, over CPU utilization is still under 40% (25% user, 12% sys), I cannot get more client boxes to test, but if it goes this way, and let’s make 80% is threshold of CPU utilization, then this kind of box can handle 10K QPS, roughly, with latency at around 3ms.

Note that CPU utilization, QPS per client, and latency is not quite clear as the overall QPS is too high, but you can get some ideas from next graph …

Here is the graph for application (login, which will do one user lookup, and then 10~100 user lookups, each lookup is to get one buddy’s information):

The result is kind of worrying me, since the CPU utilization is 70% already (45% user and 25% sys), it seems 200 QPS is what the cluster can provide. However, thinking of the login operation is doing way too many table lookups (average 55 lookups per login), so just matches the simple lookup we discussed above (10K QPS per box), while latency is at around 80ms.

Actually, 20% sys is pretty bad, means the kernel is busy switching (I didn’t check vmstat during that time, but this is a reasonable guess), but again, this may be reasonable since the machine is handling 16 active clients who are sending bunch of requests, while it has only 8 physical cores so context switching is unavoidable.

Since everything’s linear, I can assume 4 cores boxes can offer 5,000 QPS with reasonable latency. I will do some similar tests with MySQL and memcached, and I will do similar test with multiple data nodes as well, since I got impression that multiple data nodes is far slower than single node (inter-node communication?).

Dec 092009
 

It seems current design cannot bypass the bottleneck from relationship, I think I need to re-think about the design.

It was said doing denormalization at the write time is a good way, as these nosql data stores are really fast on write. However, it may be too much to do it. My current scenario is every user has 10~100 buddies, if I do the denormalization during the write time, it still not clear to me how to set up a schema to fit it.

I found I’m still sort of in RDBMS world, need to jump into this nosql universe as soon as possible :P.