Adventures of a Protoss in Seattle: Database Development Mistakes as NoSQL propaganda

Context

http://stackoverflow.com/questions/621884/database-development-mistakes-made-by-application-developers

Summary

Not using appropriate indexes
Not enforcing referential integrity
Using natural rather than surrogate (technical) primary keys
Writing queries that require DISTINCT to work
Favouring aggregation over joins
Not simplifying complex queries through views
Not sanitizing input
Not using prepared statements
Not normalizing enough
Normalizing too much
Using exclusive arcs
Not doing performance analysis on queries at all
Over-reliance on UNION ALL and particularly UNION constructs
Using OR conditions in queries
Not designing their data model to lend itself to high-performing solutions
Selfish database design and usage.
Abusing denormalised data
Scared of writing SQL
Dogmatic 'No Stored Procedures' policies.
Not understanding database design
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
They only test on toy databases.
Not using indexes.
Not communicating with experienced DBAs.
Poor Performance Caused by Correlated Subqueries
Forgetting to set up relationships between the tables.
Not using parameterized queries.
Favoring "Elegant" code over highly performing code.
Not doing the correct level of normalization.
You want to make sure that data is not duplicated
Using Excel for storing (huge amounts of) data.
Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Not adding check constraints to ensure the validity of the data.
Adding unnormalized columns to tables out of pure laziness or time pressure.
not so much about the database per se but indeed annoying.
Not taking advantage of CLUSTERED INDEXES
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.

My Thoughts

All of these are consequences of using a one-size fits-all solution for storing your data. Fact is, application developers shouldn't worry about how they use data. They should be able to get their job done without worrying about the long-beard in the back room. I've been in this role, and I can sympathize with it.

Then, I realized something has to change. I took away SQL and built a very simple RESTful layer to the data layer, and then I watched how application developers solved their problems. I was amazed at their cleverness. Instead of saying "oh, these silly application developers are so dumb and don't know shit about databases", I said "I wonder how clever they could be if I just gave them memcached and simple get/put/by_index".

They taught me a thing or to about how awesome memcache can be (especially with cron-jobs).

Ideally, if you are building the data layer, then all you need to do to enable application developers is get the right complexity class out of the data. If you have ten billion things, then you need to provide the functions that get to a thousand things relevant to what the application developer needs to do. For bigger tasks, computations are best represented with MapReduce, and I feel that MapReduce is way easier to learn for fresh application developers. CouchDB's incremental MapReduce is by far the easiest to learn.

That being said, performance is always going to be an issue. If you enable developers this way, then you need to provide a realistic environment.

Have a development server with more data than production and with a slower CPU (if you can't do this, then you the ability to connect to production in a read-only mode).
Force them to profile their code (ab works very well for most situations)
Work with business people to define how consistency should work
Train them to do cache invalidation

In my mind, core to the NoSQL movement is the ability to empower more people with persistence rather than knowing edge cases of how a database works. The sooner you empower people, the sooner the business can iterate and the more time you can dedicate to sitting on a beach thinking about unicorns.

Related entry: Big Data enables Agile Data.

Adventures of a Protoss in Seattle

Tuesday, December 14, 2010

Database Development Mistakes as NoSQL propaganda

No comments:

Post a Comment