Archive for the ‘architecture’ Category

NoSQL Deja Vu

Tuesday, February 23rd, 2010

Around thirteen years ago, I helped build a prototype for a custom CRM system that ran on an object database (ObjectStore). The idea isn’t quite as crazy as it sounds. The data was extremely hierarchical with parent companies and subsidiaries and divisions and then people assigned to the individual divisions. It was the kind of data model where nearly every query had several recursive joins and there were concerns about performance. Also, the team was really curious about object databases so it was a pretty cool project.

One thing that I learned during that project is that (at least back then) the object database market was doomed. The problem was that when you said “database,” people heard “tables of information.” When you said “data” people wanted to bring the database administrator (DBA) into the discussion. An object database, which has no tables and was alien to most DBAs, broke those two key assumptions and created an atmosphere of fear, uncertainty and doubt. The DBA, who built a career on SQL, didn’t want to be responsible for something unfamiliar. The ObjectStore sales guy told me that he was only successful when the internal object database champion positioned the product as a “permanent object cache” rather than a database. By hiding the word “data,” projects were able to fly under the DBA radar.

Fast forward to the present and it feels like the same conflict is happening over NoSQL databases. All the same dynamics seem to be here. Programmers love the idea of breaking out of old-fashioned tables for their non-tabular data. Programmers also like the idea of data that is as distributed as their applications are. Many DBAs are fearful of the technology. Will this marginalize their skills? Will they be on the hook when the thing blows up?

I don’t know if NoSQL databases will suffer the same fate as object databases did back in the 90’s but the landscape seems to have shifted since then. The biggest change is that DBAs are less powerful than they used to be. It used to be that if you were working on any application that was even remotely related to data, you had to have at least a slice of the DBA’s time allocated to your project. Now, unless the application/business is very data centric (like accounting, ERP, CRM, etc.), there may not even be a DBA in the picture. This trend is a result of two innovations. First, is object relational mapping (ORM) technology where schemas and queries are automatically generated based on the code that the programmer writes. With ORM, you work in an object model and the data model follows. This takes the data model out of the DBA’s hands. The second innovation is cheap databases. When databases were expensive, they were centrally managed and tightly controlled. To get access to a database, you needed to involve the database group. Now, with free databases, the database becomes just another component in the application. The database group doesn’t get involved.

Now that the database is a decision made by the programmer, I think non-relational databases have a better chance of adoption. Writing non-SQL queries to modify data is less daunting for a programmer who is accustomed to working in different programming languages. Still, the programmer needs good tools to browse and modify data because he doesn’t want to write code for everything. Successful NoSQL databases will have administration tools. The JCR has the JCR Explorer. CMIS has a cool Adobe Air-based explorer. Both of these cases are repository standards that sit above a (relational or non-relational) database but they were critical for adoption. CouchDB has an administration client called Futon but most of the other NoSQL databases just support an API. You also want to have the data accessible to reporting and business intelligence tools. I think that a proliferation of administration/inspection/reporting tools will be a good signal that NoSQL is taking off.

Another potential advantage is the trend toward distributed applications which breaks the model of having a centralized database service. Oracle spent so much marketing force building up their database as being the centralized information repository to rule the enterprise. In this world of distributed services talking through open APIs, that monolithic image looks primitive. What is more important is minimal latency, fault tolerance, and the ability to scale to very large data sets. A large centralized (and generalized) resource is at a disadvantage along all three of these dimensions. When you start talking about lots of independent databases, the homogeneity of data persistence becomes less of a concern. It’s not like you are going to be integrating these services with SQL. If you did, your integration would be very brittle because these agilely-developed services are in a constant state of evolution. You just need to have strong, stable APIs to push and pull data in the necessary formats.

The geeky programmer in me (that loved working on that CRM project) is rooting for NoSQL databases. The recovering DBA in me cringes at the thought of battling data corruption with inferior, unfamiliar tools. In a perfect world, there will be room for both technologies: relational databases for relational data that needs to be centrally managed as an enterprise asset; NoSQL databases for data that doesn’t naturally fit into a relational database schema or has volumes that would strain traditional database technology.

CMS Architecture: Managing Presentation Templates

Monday, January 25th, 2010

Another geeky post…

In my last post, I described the relative merits of managing configuration in a repository vs. in the file system but excluded presentation templates even though how they are managed is just as interesting. Like configuration, presentation templates can be managed in the file system or in the content repository. Like with configuration, if you manage presentation templates in the repository, you need some way to deploy them from one instance of your site to another without moving the content over as well.

There are plenty of additional reasons why you would want to manage presentation templates on the file system. In particular, presentation templates are code and you want to be able to use proven coding tools and techniques to manage them. Good developers will be familiar with using a source code management system to synchronize their local work areas and branch/tag the source tree. Development tools (IDE’s and text editors) are designed to work on files in a local file system. If you manage presentation templates in the repository you have to solve all sorts of problems like branching and merging and building a browser-based IDE or integrating with local IDEs. The latter can be done through WebDAV and I have also seen customers use an Ant builder in Eclipse to push a file with every time it has changed. Still, the additional complexity can create frustrating issues when the deployment mechanism breaks.

As much as it complicates the architecture, there is one very good case when you would want to manage presentation templates in the repository: when you have a centralized CMS instance that supports multiple, independently developed sub-sites. For example, lets say you are a university and each school or department has its own web developer that wants to design and implement his own site design. This developer is competent and trustworthy but you don’t want to give him access to deploy his own code directly to the filesystem of the production server. He could accidentally break another site or, worse, bring down the whole server. You could centralize the testing and deployment of code, but that would just create a bottleneck. You could do something like put the CSS and JS in the repository and have him go all CSS Zen Garden, but sooner or later he will want to edit the HTML in the presentation templates.

In this scenario of distributed, delegated development, presentation templates are like content into two very important aspects:

  1. presentation templates need access control rules to determine who can edit what.
  2. presentations templates become user input (and user input should never be trusted).

The second point is really important. Just like you need to think twice when you allow a content contributor to embed potentially malicious javascript into pages, you need to worry that a delegated template developer can deploy potentially dangerous server side code. Once that code is on the filesystem of an environment it can create all sorts of mischief. It doesn’t matter if it was intentional or not, if a programmer codes an infinite loop or compromises security, you have a problem. Using templating languages (like Smarty or Velocity) rather than a full programming language (like PHP or Java in JSP) will mitigate that risk but you still have to worry about the developer uploading a script that can run on your server. With staging and workflow, CMSs are good at managing semi-trusted content like presentation templates from distributed independent developers. There is a clear boundary between the runtime of the site and the underlying environment.

If your CMS uses file-system based presentation templates and you delegate sub-site development to the departments who own them, you should definitely put in place some sort of automated deployment mechanism that keeps FTP and SSH access out of the developers hands and reduces the potential for manual error. The following guidelines are worth following:

  • Code should always be deployed out of a source code system (via a branch or a tag). That way you will know what was deployed and you can redeploy the same tested code to different environments.
  • Deployments should be scripted. The scripts can manage the logic of what should be put where.
  • Every development team should have an integration environment where they can test code.

One of my clients uses a product called AnthillPro for deployments of all web applications and also presentation templates. It has taken a while to standardize and migrate all of the development teams but now I don’t see how you can have a de-centralized development organization without it.

The other dimension to this problem is the coupling between the content model and the presentation templates. When you add an attribute to a content type, you need to update the presentation template to show it (or use it in some other way). The deployment of new presentation templates needs to be timed with content updates. Often content contributors will want to see the new attribute in preview when they are updating their content. Templates also need to fail gracefully when they request an attribute that does not yet exist or has not been populated yet. Typically, presentation templates evolve more rapidly than content models. After all, a change in a content model usually involves some manual content entry. In my scenario of the university, there is a benefit of centralizing the ownership of the content model. This allows content sharing across sites: if one department defines a news item differently than another department, it is difficult to have a combined news feed. Centralizing the content model will further slow its evolution because there needs to be alignment between the different departments.

Wow, two geeky posts in a row. I promise the next one will be less technical.

CMS Architecture: Managing Content Type Configurations

Tuesday, January 19th, 2010

Warning: this post is highly technical. Non-programmers, please avert your eyes.

Deane Barker (from Blend Interactive) and I have a running conversation about CMS architectures. One of the recurring topics is how content models and other configuration is managed. There are two high-level approaches: inside the repository and outside the repository. Both have their advantages and disadvantages.

  • Managing content types outside the repository

    My preferred approach is to manage content type definitions in files that can be maintained in a source code management system. This way you can replicate a content type definition to different environments without moving the content. Developers can keep up to date with changes made by their colleagues. Configuration can be tested on Development and QA before moving to production. There is no user-interface to get in the way. No repetitive configuration tasks. Everything is scriptable and can be automated. I especially like it when content types are actual code classes so you can add helper methods in addition to traditional fields. Of course, when you get into this, it is a slippery slope into a tightly coupled display tier that can execute that logic.

    On the downside, it is often difficult to de-couple the content (which sits in the repository) from the content model (which defines the repository). When you push an updated content type to a site instance, you might need to change how the content is stored in the repository. This is more problematic in repositories that store content attributes as columns in a database. It is less of a problem in repositories that use XML or object databases (or name-value pairs in a relational database) where content from two different versions of the same model can coexist more easily.

    If you do manage content type definitions outside of the repository, a good pattern to follow is data migrations, which was made popular by Ruby on Rails. I am using a similar migration framework for Django called South. Basically, each migration is a little program that has two methods: forward and back (“up” and “down” in RoR. “Forwards” and “backwards” in South) that can add, remove, and alter columns and also move data around. The forward updates the database, the backward reverts to the earlier version.

  • Managing content types within the repository

    Most CMSs follow the approach of managing the content type definitions inside the repository and provide an administrative interface to create and edit content types. This is really convenient when you have one instance of the application and you want to do something like add a new field. There is no syntax to know and application validation can stop you from doing anything stupid. Some CMSs allow you to version content type definitions so that you can revert an upgrade.

    When you have multiple instances of your site, managing content types can be tedious and error prone if you need to go through the administrative interface of each instance and repeat your work. Of course, you can’t copy the entire repository from one instance unless you want to overwrite your content. If your CMS is designed in this way, you should look for a packaging system that allows you to export a content definition (and other configurations) so that it can be deployed to another instance. Many CMSs allow an instance to push a package directly over to another instance. The packaging system may also do some data manipulation (like setting a default value for a required new field).

Unless you are building your own custom CMS, this all may seem like an academic question. It really is quite philosophical: is configuration content that is managed inside the application or does it need to be managed as part of the application. The same thing goes for presentation templates (but that is another blog post). However, if you intend to select a CMS (like most people should), it is important to understand the choice that the CMS developers made and how they work around the limitations of their choice. If you are watching a demo, and you see the sales engineer smartly adding fields through a UI, you should ask if this is the only way to update the content model and if you can push a content type definition from one instance to another. If the sales engineer is working in a code editor, you need to ask how the content is updated when a model update is deployed.

Daniel Jacobson on de-coupled publishing systems

Tuesday, October 13th, 2009

Daniel Jacobson, NPR’s Director of Application Development, has an excellent article on the philosophy of de-coupling the content management tier from the delivery tier. He calls this strategy COPE: Create Once Publish Everywhere. In particular, the diagram is particularly useful in showing how all the pieces fit together.

If you are in the content publishing business (like NPR is), this is absolutely how you need to think about your content technology stack. Your content repository, editorial systems, and distribution channels can get sophisticated and highly specialized so compromise and lock-in can be costly. The delivery tier that came with your content management system may not scale or may not allow you to push the cutting edge if an opportunity to innovate should arise. All of my big name publishing clients have adopted this strategy for their core publishing platform.

However, as I have warned in earlier posts, the flexibility may not be worth the cost for all publishers. Unless your business model depends on aggressively leveraging your content and you can afford to play on the cutting edge, a lighter weight “website in a box” style architecture may give you the flexibility you need without the additional complexity and cost of building and integrating these de-coupled systems. As an example, Drupal is rapidly becoming a popular platform for small to medium publishers and also for smaller initiatives in larger publishers. And you cannot get a tighter coupling of management and delivery tier than Drupal. One strategy that has been used by early Drupal adopters (that have grown out of their forked versions of the platforms) is to use Drupal to publish into a custom delivery tier.

As an architect, I love the COPE model and I think that most successful, large scale content businesses will eventually converge on that strategy. However, as a pragmatist who also serves publishers in the lower and middle tiers of the industry, I know that the resources and expertise may not be available to go straight to that architecture. Still, at the very least, every publisher needs to start thinking on this level: creating and storing content in presentation neutral way to keep options open.

The CMS Decorator Pattern

Wednesday, July 22nd, 2009

Web content management systems are very good at capturing, managing, and rendering semi-structured content. They give the contributor tools for controlling the organization of a web site and the layout and composition of pages. However, when it comes to strictly relational, tabular data, all those features like versioning and preview tend to get in the way. You wouldn’t build a customer relationship management (CRM), enterprise resource planning (ERP), or accounting system on top of a content repository designed to stage and deploy content. There are better tools for that.

A common pattern when you want to present highly structured, relational data on a website alongside managed content is to manage those data outside the CMS and then use the CMS to organize and augment them. I have seen this pattern used for years but it didn’t occur to me until Will Ezell (from dotCMS) mentioned that this was a direct application of the Decorator Pattern. Until that point, I had mainly used the decorator pattern at the object oriented class level. Since then, I have been using the phrase “use the CMS to decorate your _______ data.” This description seems to resonate with people even if they are not familiar with the book Design Patterns: Elements of Reusable Object-Oriented Software.

Here is a concrete example. Lets say that your ERP system is the system of record for your core product data: price, dimensions, materials, manufacturer, weight, distributor, availability, sku and size options. But these data are dry and not at all compelling to the consumer browsing an e-commerce site. You can use the CMS to add additional information like a description and photos of the product. You can also use the CMS to control where the product appears on the website: within a collection of promoted items on the home page; as a featured product on a department page; as part of an email newsletter. When designing the shopping experience of your customers, the features of a web content management system really come in handy. You can preview and stage the pages so you can see what they will look like. You can use scheduled publishing to make the pages go live on a specific date.

From an architectural perspective, the integration does not have to be that complex. Essentially the content items that decorate your catalog data just have to be aware of a primary key that is managed in your ERP system. Most web content management systems allow to configure a content editing form to use a database to populate a dropdown box or a more complex browsing interface. So a user might go into the CMS, create a new “Product” asset and select a Product ID from a dropdown list and then add content to “decorate” that product. Most CMS rendering engines can read from external data sources. Once you have that shared key, the assembling of content and data happens at rendering time.

This is one of those approaches that is so obvious that it is frequently ignored and overlooked. Once you are aware of this pattern and keep it in mind as an option, it becomes much less tempting to overload your CMS implementation to manage data that it was not designed to manage. The user interfaces can be optimized for specific purposes and your architecture becomes cleaner. When a content type gets complicated to manage, you should always ask yourself “am I using the right system to manage this data?”

Code moves forward. Content moves backward.

Wednesday, July 1st, 2009

One of the primary functions of a web content management system is separating content from layout. Authors create semi-structured content in a display-neutral format and then the presentation templates transform that content to web pages for regular browsers, mobile browsers, RSS feeds, email, and print. As most readers of this blog know, this separation introduces big efficiencies in re-use: content is managed in one place and appears in many places and in many formats. But this magic does not come for free. Someone has to build presentation templates that render the content and that person needs to have developer skills. The template developer needs to know HTML and the special templating syntax to retrieve and format content from the repository. The developer also needs know how to test software for all the different conditions that it will encounter: different browsers; content with extreme values in any of its elements (like a really long title or a missing summary); and even high traffic load. Because one piece of code can potentially affect every page on the site, testing is important and it needs to be done in a safe place. The same thing goes for all sorts of other software configurations.

Essentially, the role of the static webmaster has been broken into two: the content contributors (that no longer need to be technical) and the developer (who needs to be more skilled than your average DreamWeaver jockey). It also breaks up the management of the site into two lifecycles: the content lifecycle and the code/configuration lifecycle. The production instance of the CMS itself is designed to support the content lifecycle. CMS have workflow functionality to manage the state of an asset from draft through published and to archived. They have preview functionality so that only contributors can see content that has not been published. Some CMS are designed sot that the development lifecycle can also occur in the production instance. This is usually done by creating workspaces or sandboxes — essentially treating code like another category of content. To be sure, you still want to have a QA instance of the system so you can test software upgrades (of the core) before applying them to your live site. In most CMS, however, developers work on separate environments (either individual developer environments and a code staging area or a shared development environment) and not the live, productive instance of the CMS.

While the content and the software lifecycles are de-coupled, they are interdependent. Developers need realistic content to develop and test code. The content relies on the code to define and display itself. There are lots of situations where these aspects get tangled. For example, when a new field is added to a content type it needs to be populated (sometimes manually) and the presentation templates need to be modified to display it. There are also cases where the line between content and code starts to blur: contributors style their content with CSS classes that are defined in style sheets; contributors can embed a tag that calls some additional display logic (like inserting a re-usable display component); contributors can build web forms that need to be submitted to code that does something with the information.

The standard approach for managing these interdependencies is what I call “code forward, content backward.” Content and configuration is developed in a development environment and tested in a QA environment. When it is ready, it is deployed to production. Content is developed and previewed in the production instance that contains the staging (or preview) and live content states. Periodically, content should be published backwards to the development and QA instances so that testing can be realistic as possible. In cases where the code/configuration and content are so tightly coupled (like when you need to break one field into two), you may need to export production content to the QA instance where you do some content transformation and test it with the newest code and then push that content and code back to production at the same time. When you do this, just make sure that you don’t have anyone adding content to production because it will be overwritten or (worse) cause some kind of corruption.

Different CMS handle the migration of code and content in different ways. Some provide nice graphical utilities to export configurations from instance of the application to another. In other products, there are ways to manually transfer the settings as a collection of files. Some products don’t support this at all so you have to manually repeat the same steps on each environment you are managing. When you are evaluating CMS products, keep these requirements in mind. Otherwise, you will be in for a surprise as you near the launch date of your new website and you need to fix bugs on live content.

Great presentation on content modeling

Tuesday, June 23rd, 2009

Deane Barker, over at Gadgetopia, has posted slides from his presentation “Just put that in the zip code field”. He gave the talk at the Web Content 2009 conference in Chicago. Unfortunately, I was not able to attend the conference and missed seeing Deane present. However, knowing that I am as passionate about this stuff as he is, Deane and I did talk at great length on content modeling during the days leading up to the conference. Oh, the war stories we told. Those conversations inspired me to write this post on pages and objects.

The reason why I find this topic so important (aside from the fact that I am a recovering DBA myself) is that content modeling capability is one of those difficult to change characteristics of a content management system. It is what I call a “load bearing wall” in the customization of a CMS. That is, while it may be possible to remediate a content modeling limitation, all the buttressing required may make such an effort impractical. Content modeling architecture is so difficult to change, in fact, that the products themselves tend to live with what they have and change very little in this area. Products that do change how they model content usually take a while to stabilize as they work out the nuances of how to generate entry forms and validation routines and the appropriate templating syntax to access the elements.

Because of all this, content modeling is a critical part of my CMS selection process. Part of my demo process requires the suppliers to implement a content model specification that is based on the client’s own content. Deane’s presentation also gives useful tips on what to look for in a CMS. In particular, I look for the ability to support specific data types and structures. Don’t know what that means? Then take a few minutes and click through Deane’s presentation. Or, better yet, look for an opportunity to see Deane present it live. You might see me there too.

Pages and Objects

Wednesday, June 17th, 2009

Back in prehistoric times, I was implementing a custom CMS for a very large computer manufacturer. The data model drew a distinction between “pages” and “objects” and I remember having a difficult time understanding (and then, later, explaining) the difference. At a high level, objects were items of content (like a “product”) and pages were containers (like a landing page that lists collections objects). The areas where my simplistic explanation tended to break down were 1) the notion of a detail page that just displayed one object and 2) unmanaged listing pages that just automatically listed objects. These were the cases where you would have pages on the site that do not map to “pages” in the repository. If you were to practice the page/object model in its truest form, you would create a page asset to wrap objects for every page on the site. This didn’t make sense when you had a product catalog with thousands of items (objects) in it. Over time, the page content type became less and less used until it was defined as strictly as a tool for building landing (also known as category) pages. That made sense because the site was really about the products and displaying them in lots of different ways. But if this had been a project to build a typical brochure site (where contributors focused on managing things like the “about” page or the “services” page) we may have gone in the other direction and focused on pages. Objects would have diminished to components that could be re-used across pages.

Another way to look at is to ask who owns the pages, the contributor or the display tier? In a page based model the user owns the page. He places the page in a site hierarchy, gives it a URL, and then fills it with content. In the object based model, the contributor feeds in the content and the display tier (the controllers and the views) has the logic to decide what content to show and how to show it. Like I mentioned in my computer manufacturer website story, the object based approach tends to do better with sites that have more content than is practical to manually organize onto pages. In an extreme example, think about if www.google.com was a website that someone had to manage. No matter how many editors they hired, it would be impossible for an editorial staff to manage every single result page. Maybe they could go in and fix the descriptions of a few index entries here and there.

When I look at web content management systems today, I see similar stories to my prehistoric custom web content management system experience. Every web content management system on the market today grew out of some project to build a website and then was abstracted to build more websites. A WCMS is either conceived as a product and then heavily shaped by its earliest customers or it starts life as an in-house project and then is abstracted into something that could be resold. Those initial uses leave their imprint and become part of the product’s DNA. That doesn’t mean that products are necessarily limited in their use. As it matures, a product runs into diverse range of potential customers that forces it to broaden its capabilities. In the real world, of course, web sites are combination of pages and objects and the contributor needs at least some level of control of both. However, this digital DNA does help determine what problems it solves more naturally (or comfortably, or intuitively) than others. Page based systems need to figure out a clean way to manage “placeless” content and object based systems need to figure out a simple way to manage basic pages. Some of these additions feel more awkward than others.

The only way to really appreciate the difference in approach and its implications for you is by demonstrating the products managing a site like yours with content like yours. I like the scenario approach where you (or the vendor) build a prototype using your content types and then testing it with your typical usage scenarios. It is only then that you will see how well it addresses your balance of objects and pages.

Gilbane slides posted

Wednesday, June 3rd, 2009

Yesterday morning I presented a three hour workshop on selecting a web content management system at the Gilbane Conference in San Francisco. I was told to expect a small audience but the room filled up quite nicely. I ran out of hand-outs but you can download a copy here. Here my slides.

Today I will be presenting in a session on WCM Architectures and Customization. My talk is about architectural patterns in web content management systems and strategies for extending them. Here are the slides.

Different Storage Models for Content

Tuesday, April 7th, 2009

Joel Amoussou has a great article explaining the benefits and implications storing your content in a relational database vs. an XML store. After making the case for when to consider XML over the more common RDBMS/ORM/POJO/Template approach, Joel provides some tips for content modeling and makes some great points about how you need to think a little differently when you work with XML.

I would like to reinforce Joel’s comment that the XML stack is quite different than technologies that you or your developers may be used to. The learning curve can be quite steep and many developers just give up before they really get it. Transitioning to an XML based architecture may not pay off for content management applications where your content types consist of a number of structured fields (like title and author) and one or more unstructured elements (like description and body) that the CMS just reads out what the author typed in – in other words, like this blog.