One of my friends asked me if this service is doable:
- Every hour multiple machines (clients) will send ~1M actions records to the service
- Each action contains: user id, action, action result, start time, end time, and couple of user profile keys
- The service should be able to deliver reports for any given time frame (minutes), within 15 min after the period finishes. For example, report for 3pm~3:30pm should be available by 3:45pm
- Reports include: how many users did a specific action with specific result (during that time frame), how many actions does a specific user did, and top number of actions taken by others users who did the same action as the user (hard to understand, but think about Amazon, “Customers bought this item also bought…)
- Most important thing is: all these requirements should be done on no more than 4 machines, include redundancy, which means should be done on 2 or 3 machines
I have to say, this is a pretty common requirements for online services (shopping, search, gaming,… could be anything), and if I can make it and make the solution linear, then it will be pretty much interesting (let’s say, 40 machines supporting 10M actions per hour, a lot already).
I will do some research (the nosql stuffs could well fit into this one), and post thinking/design here.