Adventures of a Protoss in Seattle: Guide to databases in a start-up; execute the vision

I've been doing start-ups for about 3 years now, and I love it because being in a start-up is probably the most hard-core way to live. It is super-duper fun. For sake of my own humility, I've made many errors and (am reluctantly forced to) embrace my errors as super-awesome-learning-experiences. If I were to start over (which is always a possibility), then this is how I would do it.

The Task

When you are starting up, you have a blank piece of paper and you need to turn the emptiness into a product. If you try to build it to be perfect from the start, then you will probably fail hard-core and waste a lot of time "refactoring". You need to realize that the need for change is the only technological invariant in a start-up. Changes come from many sources, and you need to hit the market as soon as your product is viable.

The Task as "the guy who knows databases"

The simple version of this task is "to not fuck things up", but it is impossible not to fuck up because that is the nature of the game. So, the task is to enable change and encourage change; that is a tall order, and it takes a bit of magic to pull off. Don't worry, this guide is here to help. The first step is to forget everything you may have learned in school about databases. In fact, forget everything you've learned during previous jobs. And, by "forget", realise that I mean that there was a context to what you learned that was relevant at the time. In a start-up, you have a blank context and you need to find what that means for your product.

Step 2. NoSQL to the rescue

The typical scenario is this: you have a bunch of users that have stuff. All of that stuff, we are going to put into a giant object per user. Where do we put it? We put it in a file (the file system is the oldest NoSQL solution on the market) with the user's email as the filename.Use one giant object for storing whatever it is that your users want to store; this object is your per-user database. (I'm glossing over collaborative features, but the key here is to center on documents as giant objects.)
Rule: Use huge objects for everything sharded by the most obvious key (i.e. user)

Benefits: It Scales! Simply use S3 as a drop-in replacement for the file system, and you are done.

Big Con: It is all disconnected! Oh shit, it doesn't do anything useful or provide any insight into what is going on.

What about concurrency? Just man up and use file locks until you need fine-grain locking; chances are, you will not need it until you get some sales guys messing around with your users' data.

Step 3. Enable Reporting

When you need it, you replicate from your file to the database/search/other thing. That is, when you perform an operation on your file that changes something, you then need to update wherever you replicated to (I know, I know, amazing insight). I recommend focusing on table differences since it works very well with database systems. You can quickly enable table differences by writing a function that generates an in-memory version of your table (per user) respective of your user. Any operations that changes this table need to store a copy of the table (per user) before, then change whatever, then compute a copy of the after. The difference focuses on items that (a) are new and need to be inserted and have primary keys allocated, (b) are deleted, (c) are changed. It sucks, but it will work for enough time. Another option is to delete everything in the database respective to that aspect and regenerate it (this works well when you need to do it asynchronously and have to rely on a potentially out of order queue).

Benefits: Huge Scale. As a bonus, you are ready for both database sharding and map reduce techniques should you determine which will work for you (or maybe a mixture of both).

Con: It is slow to replicate. Yes, replication sucks for many reasons. However, you can get a quick boost of adrenaline by injecting a queue and doing the replication asynchronously.

What about the CAP theorem?. The CAP theorem is a giant kick in the balls, but you don't need to worry about it yet until the product has matured to the point where CAP actually matters. Just make sure you back-up hourly... ;)

Step 4. Build Secret Thing

This is where you collect data from your users and turn it into happiness. This is what makes your product awesome. This is what makes it work. This is what you want to build without worrying about stupid scalability or performance crap. This is the thing you think about while in the shower. You are on your own; good luck.

Step 5. Prepare for changes

At this point, you have users and documents that have some form of reporting. You have something cool that makes the business work. That is 10% of the challenge for a start-up. After building it, you no longer have a blank sheet of paper. You have something, and you need improve it. Chances are, the technology and code doesn't need to be improved for its own sake. Honestly ask yourself these questions once the prototype is built: Does it have a market? What can make the product more marketable? Your few users are complaining, how can the complaints be resolved? Your visitors are bouncing, why are they bouncing? Does the UX need to be improved? Answers to these questions are where changes come from, and they come from everywhere. You need to be able to quickly adapt and respond in a timely and reasonable manner.

Aside: Is this a good strategy? For technology awesomeness? No. For business? Yes! The above strategy has commercial vendor support. Look at S3 and SimpleDB, they basically do what I am promoting at some level with EC2 for building your secret sauce. Do I recommend jumping in Amazon's boat? Not yet, you may not need it. The optimal goal is to make the product as fast as you can and try to determine the market-fit of the product. The faster you make it, the sooner the entire company can be involved in product iteration. If you can't market or sell the product, then it doesn't matter if you need to scale it up or have Google's performance. However, you could probably open source it and build a consulting group around it if you can neither market nor sell it.

Step 6: Support users

We live in a golden time where storage is practically free and infinite. Since the core to this strategy is a Key Value Pair System (file system, s3, etc), the best thing you can do to support users is record everything. Every time you save your user's data, keep a copy of it with a time stamp, IP, and the user id that invoked the write. Just do it. This enables you to support your users in ways that aids both operations (faults/security breaches), developers (finding bugs), and the user (data recovery).Also, make a backup every hour and test recovery every week. ;)

Epilogue: Good Luck

If you are in a start-up, then good luck.I hope you have a good idea that will make my life better, and I don't want you to fuck up on the execution.Give me something that will make me better.Give me something that will make me happy. Give me something that I want. Give me something that my woman will want. And, I? I will give you money for it. How much? $9.99 or maybe if you are lucky, $19.95. Do not fuck up the execution; find the humility to just "make it work" even if it kills you on the inside.

Adventures of a Protoss in Seattle

Saturday, August 14, 2010

Guide to databases in a start-up; execute the vision

No comments:

Post a Comment