Friday, October 15, 2010

NoSQL Design; a primer for future data architects

*This is a rough outline of a book I'm working on; after writing this, I realized that there is a lot of knowledge debt and background understanding.*

Having done RDBMS/SQL based design for the past 10 years, I've been skeptical about how NoSQL works and impacts how we store and query data. For the past two years, I've engrossed myself in using MySQL to emulate and figure out NoSQL design patterns.

Rule 1: Denormalize like a Crazy Person

Normalization in Databases is a side-effect of using Databases. If you buy in to using a Relational Database, then you buy into normalization strategies. In Relational land, normalization allows you to use SQL as a rich query language. For NoSQL, you must denormalize and think in fewer "tables". Think in terms of documents. How can one document store a lot of related data?

A user has multiple phone numbers, and you can represent this with a table consisting of tuples (user_id, label, number). Or, you can augment the user document with a field that stores in array of records. That is,

class Phone { string label; string number; }

class User { guid id; Phone [] phone; }

Serialize your user object into a JSON string and ship it off to a NoSQL solution.

Rule 2: Embrace the "Sea of Shit" (i.e. schema-less design)

Just like how dynamic typing makes my ass itch years ago, so does schema-less design. However, the mode of thought is to think document versioning and namespace partitioning. That is, give every document a field called "class_type" and a field call "class_type_version", and then use them in the obvious ways.

The consumers (developers and yourself) of your data should understand that the schema has multiple versions and have a way to gracefully degrade or be able to initiate a remote upgrade. Alternatively, there could be an upgrade script that does this, but I find that doing it lazily works well if you find the discipline to control and work against versions.

Rule 3: Dominate Complexity with a Dominating Subset Index

Complexity sucks, and the ultimate goal is to get complexity down to O(1) or O(f) where f is linear to the output of the page (and at most sub linear to the entire data set). While half of your view code will be dedicated to viewing single documents, the other half is aggregated/indexed sets of documents. Anything is special case, and handled by replication.

Conquering this requires thinking in Dominating Subsets where you place an index (and store the index as a document) on your documents and you have some efficient way of bring the index (or a subset) to a developer. This is where you do the dreaded join in application logic, but it will be ok as long as the complexity of the join is related to the output of the page. Relax, it will be ok.

Rule 4: Replicate like a Pirate

Disk space is cheap, and memory is getting cheaper. Unless you are google, then a single server can solve just about any problem you have if you can just get the data to it. With node.js or node.ocaml, it is feasible to build the services that drive the business in a customized fashion. Once you get past the single server, the service based design becomes its own challenge. However, it is now isolated from the rest of the ecosystem and can be measured and monitored independently.

Rule 5: Cache, Cache, and then Cache some more. Invalidate!

Fundamentally, you could cache everything forever with no time stamps if you just knew how to recompute the caches based on what you update. This is a fundamentally difficult problem, but it can be easily figured out with dependency graph of how your reads depend on your writes. It sounds simple, but it isn't at the application level.

No comments:

Post a Comment