Tuesday, December 14, 2010

Database Development Mistakes as NoSQL propaganda

Context

http://stackoverflow.com/questions/621884/database-development-mistakes-made-by-application-developers

Summary
  • Not using appropriate indexes
  • Not enforcing referential integrity
  • Using natural rather than surrogate (technical) primary keys
  • Writing queries that require DISTINCT to work
  • Favouring aggregation over joins
  • Not simplifying complex queries through views
  • Not sanitizing input
  • Not using prepared statements
  • Not normalizing enough
  • Normalizing too much
  • Using exclusive arcs
  • Not doing performance analysis on queries at all
  • Over-reliance on UNION ALL and particularly UNION constructs
  • Using OR conditions in queries
  • Not designing their data model to lend itself to high-performing solutions
  • Selfish database design and usage.
  • Abusing denormalised data
  • Scared of writing SQL
  • Dogmatic 'No Stored Procedures' policies.
  • Not understanding database design
  • Not using version control on the database schema
  • Working directly against a live database
  • Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
  • Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
  • They only test on toy databases.
  • Not using indexes.
  • Not communicating with experienced DBAs.
  • Poor Performance Caused by Correlated Subqueries
  • Forgetting to set up relationships between the tables.
  • Not using parameterized queries.
  • Favoring "Elegant" code over highly performing code.
  • Not doing the correct level of normalization.
  • You want to make sure that data is not duplicated
  • Using Excel for storing (huge amounts of) data.
  • Unnecessarily using a function on a value in a where clause with the result of that index not being used.
  • Not adding check constraints to ensure the validity of the data.
  • Adding unnormalized columns to tables out of pure laziness or time pressure.
  • not so much about the database per se but indeed annoying.
  • Not taking advantage of CLUSTERED INDEXES
  • Not using a SERIAL (autonumber) datatype as a PRIMARY KEY
  • Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
My Thoughts

All of these are consequences of using a one-size fits-all solution for storing your data. Fact is, application developers shouldn't worry about how they use data. They should be able to get their job done without worrying about the long-beard in the back room. I've been in this role, and I can sympathize with it.

Then, I realized something has to change. I took away SQL and built a very simple RESTful layer to the data layer, and then I watched how application developers solved their problems. I was amazed at their cleverness. Instead of saying "oh, these silly application developers are so dumb and don't know shit about databases", I said "I wonder how clever they could be if I just gave them memcached and simple get/put/by_index".

They taught me a thing or to about how awesome memcache can be (especially with cron-jobs).

Ideally, if you are building the data layer, then all you need to do to enable application developers is get the right complexity class out of the data. If you have ten billion things, then you need to provide the functions that get to a thousand things relevant to what the application developer needs to do. For bigger tasks, computations are best represented with MapReduce, and I feel that MapReduce is way easier to learn for fresh application developers. CouchDB's incremental MapReduce is by far the easiest to learn.

That being said, performance is always going to be an issue. If you enable developers this way, then you need to provide a realistic environment.
  • Have a development server with more data than production and with a slower CPU (if you can't do this, then you the ability to connect to production in a read-only mode).
  • Force them to profile their code (ab works very well for most situations)
  • Work with business people to define how consistency should work
  • Train them to do cache invalidation
In my mind, core to the NoSQL movement is the ability to empower more people with persistence rather than knowing edge cases of how a database works. The sooner you empower people, the sooner the business can iterate and the more time you can dedicate to sitting on a beach thinking about unicorns.

Related entry: Big Data enables Agile Data.

No comments:

Post a Comment