Saturday, January 15, 2011

vacuum.js alpha (slightly inspired by nature/biology based computation)

I just cut a very rough version of vacuum.js to test it out with a couple of friends.

One thing that I miss from RDBMS land is the ability to fix bad data using command line SQL. For instance, with this code

update places set state=some_complicated_function(state) where state='blah';

I can fix bad data that somehow got into the system. A consequence of moving fast is that you get unexpected data (that may have more value than you think) that can break. Either you plan it all out or you let some criminals in the database to see how the business case works out. Sometime, a lack of planning enables sales to hack things out on their own. ;)

Anyway, back to vacuum and CouchDB. There isn't a way to update the entire database or a subset of the database easily; the easiest option you have is to build a view that brings out bad data and then a cronjob that cleans things up every now an again.

That's vacuum.js alpha.

My example data-set that contains a bunch of US state abbreviations. Except, some are very bad and malformed. In RDBMS, this is fairly easy to fix. With vacuum.js, its even easier to fix.

The current example provides two functions that run in the CouchDB context which marks documents as bad if they don't have a two letter state code (I could verify that it exists); the other function will try its very best to fix it.

The current version is a quick hack job, and I'm sitting down now to flush out a better version. As of now, its good enough to test and apply singular fixes for a couple of my project; however, since it polls, it is going to waste a lot of time on bigger systems. Check out the README to see where my current thoughts are.

As an interesting aside, I think NoSQL and thinking NoSQL design enables nature/biological thinking with MapReduce. I think it is very interesting to think of code working in an ecosystem where you can program receptors and bind actions to them. I may rename vacuum.js to something more in this vein when I'm finished with my plans. Using the same type of system, I could split/merge data with functional programming paradigms. I think this is the key to designing with MapReduce and CouchDB.

This is definitely going into the book I'm writing about CouchDB design.

No comments:

Post a Comment