Tuesday, February 23, 2010

NoSQL Deja Vu

Around thirteen years ago, I helped build a prototype for a custom CRM system that ran on an object database (ObjectStore). The idea isn't quite as crazy as it sounds. The data was extremely hierarchical with parent companies and subsidiaries and divisions and then people assigned to the individual divisions. It was the kind of data model where nearly every query had several recursive joins and there were concerns about performance. Also, the team was really curious about object databases so it was a pretty cool project.

One thing that I learned during that project is that (at least back then) the object database market was doomed. The problem was that when you said "database," people heard "tables of information." When you said "data" people wanted to bring the database administrator (DBA) into the discussion. An object database, which has no tables and was alien to most DBAs, broke those two key assumptions and created an atmosphere of fear, uncertainty and doubt. The DBA, who built a career on SQL, didn't want to be responsible for something unfamiliar. The ObjectStore sales guy told me that he was only successful when the internal object database champion positioned the product as a "permanent object cache" rather than a database. By hiding the word "data," projects were able to fly under the DBA radar.

Fast forward to the present and it feels like the same conflict is happening over NoSQL databases. All the same dynamics seem to be here. Programmers love the idea of breaking out of old-fashioned tables for their non-tabular data. Programmers also like the idea of data that is as distributed as their applications are. Many DBAs are fearful of the technology. Will this marginalize their skills? Will they be on the hook when the thing blows up?

I don't know if NoSQL databases will suffer the same fate as object databases did back in the 90's but the landscape seems to have shifted since then. The biggest change is that DBAs are less powerful than they used to be. It used to be that if you were working on any application that was even remotely related to data, you had to have at least a slice of the DBA's time allocated to your project. Now, unless the application/business is very data centric (like accounting, ERP, CRM, etc.), there may not even be a DBA in the picture. This trend is a result of two innovations. First, is object relational mapping (ORM) technology where schemas and queries are automatically generated based on the code that the programmer writes. With ORM, you work in an object model and the data model follows. This takes the data model out of the DBA's hands. The second innovation is cheap databases. When databases were expensive, they were centrally managed and tightly controlled. To get access to a database, you needed to involve the database group. Now, with free databases, the database becomes just another component in the application. The database group doesn't get involved.

Now that the database is a decision made by the programmer, I think non-relational databases have a better chance of adoption. Writing non-SQL queries to modify data is less daunting for a programmer who is accustomed to working in different programming languages. Still, the programmer needs good tools to browse and modify data because he doesn't want to write code for everything. Successful NoSQL databases will have administration tools. The JCR has the JCR Explorer. CMIS has a cool Adobe Air-based explorer. Both of these cases are repository standards that sit above a (relational or non-relational) database but they were critical for adoption. CouchDB has an administration client called Futon but most of the other NoSQL databases just support an API. You also want to have the data accessible to reporting and business intelligence tools. I think that a proliferation of administration/inspection/reporting tools will be a good signal that NoSQL is taking off.

Another potential advantage is the trend toward distributed applications which breaks the model of having a centralized database service. Oracle spent so much marketing force building up their database as being the centralized information repository to rule the enterprise. In this world of distributed services talking through open APIs, that monolithic image looks primitive. What is more important is minimal latency, fault tolerance, and the ability to scale to very large data sets. A large centralized (and generalized) resource is at a disadvantage along all three of these dimensions. When you start talking about lots of independent databases, the homogeneity of data persistence becomes less of a concern. It's not like you are going to be integrating these services with SQL. If you did, your integration would be very brittle because these agilely-developed services are in a constant state of evolution. You just need to have strong, stable APIs to push and pull data in the necessary formats.

The geeky programmer in me (that loved working on that CRM project) is rooting for NoSQL databases. The recovering DBA in me cringes at the thought of battling data corruption with inferior, unfamiliar tools. In a perfect world, there will be room for both technologies: relational databases for relational data that needs to be centrally managed as an enterprise asset; NoSQL databases for data that doesn't naturally fit into a relational database schema or has volumes that would strain traditional database technology.