Thursday, December 16, 2010

Understanding a sea of JSON with Map Reduce

CouchDB stores a lot of data in a sea of JSON, and it isn't exactly easy to get a good grasp on what there is.

For WIN, I force each object to have a name-space field called 'ns'; this enables me to partition the data and enable developers to partition the data. Ideally, this helps in keeping things separate.

A fundamental problem is that I want to have an idea of what it is in the data set and be able (and enable developers) to write appropriate documentation so everyone stays on the same page. I would also like data to adher to some kind of structural quality. However, it would be nice to be able to look for oddities that could become future support issues (it would also be nice if everyone used the same language and kept things consistent; I would rather nip inconsistencies in the bud earlier rather than later).

So, I flatten the structural qualities of each object and count them using this code (for CouchDB's incremental MapReduce).

http://pygments.org/demo/12753/ (alternative http://pastie.org/1384759 )

This enables me to grep the code base and then use blame to work with the developer to resolve oddities. Or, I can turn a blind eye because it isn't in a table that matters that much (i.e. meta data or user controlled data).

I can monitor this for changes daily to determine what is happening on development (where oddities first get introduced).

This mode of thinking enables me to think about unicorns when it comes to the database (oh, and never allowing anyone to delete; everything goes to trash with an trash_goes_out_on field that is set for 60 days in the future when it will be actually deleted).

2 comments:

  1. are you using a view to filter your results by trash_goes_out_on?

    cuz that isn't possible with just views in CouchDB.

    ReplyDelete
  2. @JK

    You basically have to index all your documents that have a .ns=='trash' and then emit based on the date it should be deleted. Then you set up a cron-job to start deleting things for the current day. It also helps to capture a single command stamp so you can recover data in bulk.

    ReplyDelete