Oct 262012

Here is the to-do list for the mail/IM setup:

  1. IM message archived to mysql, but need to compose session to mail and drop to mail box to make backup more reliable (mail will be automatically copy to some gmail account)
  2. mail alert is not done, current stage is that I need to determine the right XMPP message type to use, logic-wise I’ve done the design
  3. need a quick design on mail seaerch
Oct 192012

I setup a mail server and an IM server, again … just for whatever reason. It’s fun though.

I’m still with openldap, postfix, mysql, ejabberd, but use dovecot replaces courier for imap, amavisd integrated with spamassassin and clamav to replace previous spamassassin-only system, and roundcube replaces squirrelmail. Things are more or less easier to setup.

There are two things left, one is that need to direct spams to spam folder, this was done by maildrop but since I’m away from courier, procmail may be a more reasonable choice but still need to evaluate. The other thing is that I want to send all incoming and outgoing mails to some other gmail accounts so that I can keep a copy of everything, but I haven’t decided what’s the better approach for this. Also, if it is possible to get XMPP messages backup somewhere, that will be great.

I was thinking of building up a search feature for this mail system, I haven’t got exact design done yet, but some features are in my mind: close to realtime (tens of seconds latency), attachment friendly (dig into attachments to find context), and scriptable, i.e. core engine can be (or have to be) C/C++, but lots of external stuffs should be able to be done in PHP/Perl, etc.

Let’s see.

May 242012

OK, if you don’t know what I’m saying, don’t bother to ask.

I’m using MMSEG as segmentation algorithms to extract Chinese terms from web pages, and then do some follow-up analysis. MMSEG is pretty good, easy to understand, fast, data driven, fast, small, … alright, lots of advantages. However, it does not handle Chinese-English-mixed word properly, as it just take out English (Latin) words and return it as token, thus you can never extract words like U盘 or T恤 (sorry for Chinese here, the first words is USB disk, and the second is T-Shirt).

I got suggestion from a colleague that treat Latin words as Chinese character, I found this solution was really simple, neat, and easy to implement. Within ~30 minutes I finished code change and unit test, it works just fine :). In the coming days I need to solve words like “C++” as it was not covered by current implementation.

People are saying Yahoo is not handling C++/C# properly, while Bing is not dealing with U盘 right. Actually this sound weird to me as supposedly they both use the same web search system …