October 2009 – Flying Bug

Moving to Ubuntu 9.10

Oct 302009

I just upgraded following machines to ubuntu 9.10:

debian 5
ubuntu LTS 8.04
ubuntu 9.04

so now for all debian-like distro I’m running pure Ubuntu.

Fedora will release 12 in mid-November, I will “upgrade” both centos and fedora 11, and after that I will keep running both Ubuntu and Fedora, as my Linux platform, 50-50.

How to change uid on OS X Leopard

Triviality No Responses »

Oct 282009

Seems the PowerBook is the last Unix system that does not have the right uid for myself, but obviously changing uid is totally different from Linux/FreeBSD.

Here is the source and it tells using dscl, and found dscl can do almost all user/group management, and could be more, but I don’t need more :P.

I believe I will need to re-install bunch of dev/testing nodes @ home in near future, so I need a online backup service, better be free (I just need it for ~2 weeks), with ~10G space, supporting rsync.

Will dig around to see any luck, what I got earlier is 2G for free, but there is no rsync support.

Difference among hasing algorithms

Triviality No Responses »

Oct 282009

I don’t think I will have to face hashing algorithms directly as I expect libraries/servers that I’m going to use will provide the best one from their perspective, however, just curious how things are going and also to practice consistent hashing, I wrote a simple perl script.

The result is pretty interesting, I know I cannot cover all test cases, but from various tests I learned:

FNV is really a bad candidate for hashing … fnv-1 is worse than fnv-1a
CRC32 is not good as expected
SHA1 provides acceptable result
MD5 is doing pretty good

Performance wise – CRC32 is the winner, MD5 is ~20% slower, SHA1 is ~33% slower. I cannot tell performance of FNV as it is a pure perl implementation while others are Perl module which means results are from C/C++ codes.

This could be my last “research” on consistent hashing, I’m going to move to deployment and try out all those NoSQL solutions.

Snowy

Triviality No Responses »

Oct 272009

Snowy Finally I got a photo for Snowy.

He just took bath, so looked a little bit weird.

NoSQL data stores

Triviality 2 Responses »

Oct 262009

Here is the list I composed during the weekend:

Redis – No big name (yet?)
MemcacheDB – Used by sina.com.cn
MongoDB – SourceForge
Tokyo Cabinet and Tokyo Tyrant – Used by mixi.jp
Cassandra – Used by Facebook and Digg
Voldemort – Used by linkedin
Riak – No known web sites yet.
CouchDB – Used by BBC and Ubuntu (the Ubuntu One project)
Dynomite – A clone of Amazon’s Dynamo, seems no user yet.

I’m not sure if I should try out these projects, the major concern is that I don’t have confidence on grid (hadoop, etc) about their latency:

and I won’t consider these because they are relying on specific providers:

BigTable
Dynamo
[Could be more …]

Java and erlang are pretty popular here – I don’t like one of them, and I don’t know much about the other, tough.

http://memcachedb.org/

Don’t bite back

Triviality 1 Response »

Oct 262009

Talking to myself – DON’T bite back.

Let them be the winner dogs. 😀

Stick with Fedora and Ubuntu

Triviality No Responses »

Oct 252009

I’m building NoSQL stuffs that are pretty much from source (instead of well packaged), so I have some restriction on libraries version, such as libevent, Berkeley DB, etc.

Obviously CentOS is a bad choice for this as its goal to maintain stability, similar to Ubuntu LTS version. Surprisingly, Debian is not that up-to-date in repo (for example, db4.6 only, not 4.7).

So I think I should stick with Fedora and Ubuntu (non-LTS version) which seems keeping cutting edge stuffs. I will also check other distro – I guess as long as they can maintain 6-months release cycle, they will be “up-to-date”.

I will get rid of CentOS and Debian for now, and if I cannot find any other distro meet the requirement I will make all my testing nodes running Fedora and Ubuntu, half-half.

Plan for NoSQL

Triviality No Responses »

Oct 232009

I’ve read too many articles talking about this NoSQL stuffs, now I have to have a plan to proceed (with what? :-W).

First of all, I’m going to remove MySQL and OpenLDAP installation in my testing environment :P. MySQL is kind of slow to me though I have pretty much experience in setting it up, include replication, etc, and I will check to see if all applications can be done in a key-value based data store (check below). OpenLDAP is another story, I still haven’t figure out how to set it up with replication – last time I tried was 2.2/2.3, but 2.4 introduces a whole new approach to do replication and I think I’m going to leave it untouched for now. Note that I still need to come back to LDAP later on since it is still perfect solution for running corporate-like application, such as what I did couple of times before – integrate mail, IM, wiki, blog, etc together with single user-id.

OK, back to NoSQL, couple of things to do:

Consistent hashing, I still need to read all those articles and try out different implementation, I don’t think I will compose my own, but I need something can work on Linux and Windows (OSX? Don’t think so), and support some major programming languages (C/C++, PHP/Python, Java). I also need to do test similar to what I’ve done and understand how it affects deployment.
Try out different engines, most likely I won’t try out things too fancy (read it “complicated”), for example, I will prefer Redis over memcachedb just because memcachedb’s replication is not that simple to me, I believe anything complicate in setup will be a headache in maintenance. I will also skip those so-called document store/graph store unless they can support simple key-value store in same performance (then those features will be a nice add-on). I don’t have the list so far, but I will get one in the coming weekend. Things to be tested include installation, replication, fail-over, backup and recovery, monitoring, etc. Also programming language supported will be another important factor, I wish a similar list to item #1.
Application … I’m going to conclude “traditional” web features that involves data store, and check to see how to implement them in distributed key-value data store. For example, user registration, login, edit preference/profile is one of the fundamental features, and buddy related operation (add as buddy, blacklist, check online status, notify buddy for event/be notified by buddy) is another one. Things current in my mind include message feature (internal/external IM/mail), post features (threaded post like forum, vote/survey may be in this category as well), and maybe some search features. I don’t think I can come up with a full list in the coming days, but I will keep posting here.

This is pretty much what’s in my mind. All these stuffs are seems to be new and almost none of them are well packaged, so after 4~5 years of using yum/apt, now I need to do what I used to do – build everything from scratch, if I have time, I will compose some packages so to ease my deployment.

NoSQL – start with consistent hashing

Triviality 2 Responses »

Oct 232009

Most NoSQL solutions are kind of caching, with persistent data store, with or without replication support. One of the key issue in production environment is using consistent hashing to avoid cache failure.

I talked to a friend days ago about memcached deployment problem, he asked question about what to do with adding new memcached node to expand capacity, to avoid loading bunch of data from database to cache nodes. I told him I don’t have any experience, but if I encounter this problem, I will try to restart memcached client machines one by one, to use new configuration, so to avoid put massive load to database, also I will think about changing hashing function of memcached client, try to maximize entries that can keep partition unchanged.

It turned out my second idea is correct (I should have read all those articles before talking to him :P). There are couple of articles discussing about this issue, and the good start point, of course, is wikipedia.

I tried libketama, seems pretty good in term of retention rate. I did some tests that could be (sort of) real world use case. Say, we have 4 weak (512M) nodes and want to replace them with all new nodes with double capacity (1G), I’m going to add new nodes to the cluster one by one, and then remove old nodes one by one, and here are what I got:

cluster	capacity	capacity changed	key moved
4x512M	2G	0%	0%
4x512M 1x1G	3G	50%	40%
4x512M 2x1G	4G	33%	30%
4x512M 3x1G	5G	25%	25%
4x512M 4x1G	6G	20%	20%
3x512M 4x1G	5.5G	8%	12%
2x512M 4x1G	5G	9%	13%
1x512M 4x1G	4.5G	10%	18%
4x1G	4G	11%	19%

relatively, percentage of keys got moved to other partitions is close to capacity changes, which means it is close to the best number.

And key distribution is pretty even (capacity/utilization, node #1~#4 are 512M, #5~38 are 1G):

node #1	node #2	node #3	node #4	node #5	node #6	node #7	node #8
25.0% 25.6%	25.0% 21.7%	25.0% 24.7%	25.0% 28.0%	–	–	–	–
16.7% 16.9%	16.7% 15.2%	16.7% 19.0%	16.7% 17.7%	33.3% 31.1%	–	–	–
12.5% 13.5%	12.5% 10.8%	12.5% 13.7%	12.5% 12.7%	25.0% 24.5%	25.0% 24.8%	–	–
10.0% 10.9%	10.0% 9.4%	10.0% 11.0%	10.0% 8.3%	20.0% 19.6%	20.0% 20.0%	20.0% 20.9%	–
8.3% 8.9%	8.3% 8.3%	8.3% 8.1%	8.3% 7.0%	16.7% 16.7%	16.7% 17.1%	16.7% 17.9%	16.7% 16.1%
–	9.1% 9.0%	9.1% 9.6%	9.1% 8.2%	18.2% 17.5%	18.2% 18.3%	18.2% 19.8%	18.2% 17.6%
–	–	10.0% 9.7%	10.0% 8.9%	20.0% 20.3%	20.0% 20.5%	20.0% 21.9%	20.0% 18.6%
–	–	–	11.1% 9.2%	22.2% 22.3%	22.2% 22.2%	22.2% 25.2%	22.2% 21.1%
–	–	–	–	25.0% 24.2%	25.0% 24.5%	25.0% 27.2%	25.0% 24.1%

I still need to try out fnv to see if it has better distribution and/or less key shakiness, from the article above it was said at least it has better performance.

Older Entries