Wednesday, May 21, 2008

Content is not Data

David Nuescheler, CTO of Day Software and spec lead for the Java Content Repository specifications JSR 170 and 283, likes to say "everything is content." This is a bold statement that is intended provoke thought but I think that it is also a reaction to a prevailing view among technologists and database vendors that everything (including content) is just data. While it is true that content, when stored electronically, is just a bunch of 0's and 1's, if you think that content is just data, you need to get out of the server room because that is not how your users see it. There are four main reasons why.

  1. Content has a voice. Put another way, content is trying to communicate something. Like me writing this blog post, a content author tries to express an idea, make a point, or convince someone of something. Communication is hard and requires a creative process so authoring content takes much more time than recording data. Content is personal. If the author is writing on behalf of a company, there may need to be approvals to ensure the voice and opinion of the company is being represented. The author may refer to raw data to support his point, but he is interpreting. For example, even a graph of data may reflect some decisions about what data to include and how to show them. Because content has a voice, content is subjective. We consider the authority and perspective of the author when we decide whether we can trust it.

  2. Content has ownership. Data usually do not have a copyright but content does. The people who produce content, like reports, movies, and music, get understandably annoyed when people copy and redistribute their work. While data can be licensed, it is less common. Often data are distributed widely so that more people can provide insight into what they mean. Interestingly, when content is digitally stored as data on a disk, we think less about it in terms of content. For example, we are OK with data backups of copyrighted material even though creating copies is forbidden.

  3. Content is intended for a human audience. While content management purists strive for a total separation of content and presentation, content authors care about how content is being presented. They may have a lot of control over presentation and obsess over every line wrap or they only get to choose what words are bolded or italicized. They will only semantically tag a phrases in a document if they know that it will make for a richer experience for the audience. Presentation is not just for vanity's sake. Presentation, when done well, helps the audience understand the content by giving cues as to how things are organized and what is important. While the Semantic Web is all about machines understanding web content, at the end of the day, the machines are just agents trying to find useful information for human eyeballs (and eardrums). Content is authored with the audience in mind while data is just recorded.

  4. Content has context. In addition to who wrote the content, where it appears also matters. We care greatly how content is classified and organized because we want to make it easier to find. A database table doesn't care about the order of its rows (it is up to the application to determine how they should be sorted). Content contributors really care about where their assets fall in lists (everything from index pages to search results).

These distinctions may seem totally academic but I think they have real implications for the technologies that manage content. Because content is much more than "unstructured data," we can't think about the tools we use to manage and store it just in terms of big text fields in a relational database and forms to update these rows. Content is a personal experience for both the author and the audience and the technology that intermediates needs to be sensitive to that. Every once in a while there is a meme about "content management" becoming an irrelevant term because it will be subsumed into other more process or industry oriented disciplines. If that does happen, it is critical that certain content technology features and concepts carry over.

  1. Versioning. Content goes through a life cycle of evolution and refinement as groups of contributors work together to achieve the best way to convey the information and ideas. Some content assets (like policies and procedures) are updated hundreds of times over many years as information changes. Other assets go through many rapid iterations over a shorter period of time (such as an intensely negotiated contract). Often participants in a content life cycle need to know just what has changed. For example, a copyeditor can save time by just proofreading the changes since the previous copy edit. A translator may not need to re-translate an asset if only a minor edit was made. Sometimes the history of change can give insight into the spirit of meaning. Versioning is not just for reverting to older versions. A robust versioning system has features like version comparison and annotations.

  2. Control over the delivery. To effectively communicate, you need to tune your delivery to your audience. WYSIWYG editing and preview both try to give a content contributor the perspective of their audience. WYSIWYG editing gives a non-technical contributor some control over the styling over text. It is important that the WYSIWYG editor gives an accurate representation (as in the same CSS styles) of what a visitor will see. Single page preview puts the content into the context of a page by executing rendering logic. The more complex the rendering logic, the more difficult it is to control what the user sees. For example, if there is some logic to automatically display relevant related content, the preview environment has to have the same content, rendering code, and user session information as the production environment. Oftentimes, this is hard to do. I have had clients really struggle over controlling dynamic rendering logic. For example, a relevance engine automatically associated inappropriate images with articles or showed the same related content multiple times. Some users also like to see how articles show up on dynamic indices and search results. In these complex delivery tiers, preview is a lot more like software QA than simple visual verification - you need to test all the scenarios and parameters. A good practice is to delineate pages or sections that you want full editorial control over and other (less important sections) that are not worth the manual effort of controlling.

  3. Feedback. You can't communicate in a vacuum. You need feedback. However, most content contributors lob their content over the wall and then forget about it. When you are speaking in front of a group you can gauge reaction and make adjustments. As the web turns into a conversation, the content contributor needs to be listening as much as they are telling. Most content contributors underuse web analytics. The more accessible this information can be made, the better. Many web content management systems integrate analytics packages and have nice features like analytics overlays over rendered pages. However, these features are not used enough. More commonly, an analytics report will be circulated around to people who don't understand how to read it. Comments and voting can also be a powerful medium for adjusting and reacting to feedback either by direct response or by using knowledge of the audience in subsequent articles.

  4. Metadata. While metadata storage is trivial, capturing and using this information is a challenge. Metadata such as source and ownership are critical to tell the audience where the asset comes from (its voice and authority) and how it can be legally used. Metadata is also important for classification and context. Content contributors are notoriously bad at metadata entry: they either neglect or abuse it. Automation is part of the solution, but a good process involves humans with the responsibility for metadata (bring on the librarians!). The best way to leverage and exchange metadata is through standards based formats. Industry oriented formats (like NITF) are important because they have a standard set of metadata built in. Microformats are also useful for highlighting specific bits of standard information within rendered web pages. While most WCM platforms can produce these outputs through their templating tier, very few do any validation of the output. Reviewers just visually validate what they see on a preview page.

  5. Usability. Most of all the system needs to be easy to use. Creating content is hard work no matter how you do it. Any system that distracts or complicates a user from the creative process of developing content is bound to be un-popular and the first excuse for failure. The ideal content management system disappears from the user's consciousness by being familiar and frictionless - you don't need to think about it and it gives you immediate results. For many people, that is Microsoft Word (until Word tries to outsmart you and take over your document) and I have already mentioned the disturbing amount of web content that originates in MS Word. For some, blogging tools are approaching this level of usability. For others, in-context editing achieves it. In many cases users get so familiar with a tool that they forget they are using it even if the tool is hard to learn at first (I am reminded of this when my fingers just automatically type the right commands in vi). This usually only happens when you have specialists operating the CMS rather than a distributed authoring where all the contributors enter their own content.

If you are building an application that also needs to manage content, don't just think of the content in terms of CRUD for semi-structured data. Luckily, components and frameworks are available to incorporate into your architecture. The Open Source Web Content Management in Java report covers Alfresco, Hippo, and Jahia from this perspective. Recently, I have been playing around with the JCR Cup distribution of Day's CRX that bundles Apache Sling (very cool!). Commercial, back-end focused products like Percussion Rhythmyx and Refresh Software SR2 certainly play in this area. People used to deploy Interwoven Teamsite for this but I think it is too expensive to be used in this way. Bricolage is an open source back-end only WCM product written in Perl. But accurate preview and content staging can be complicated in decoupled architectures. Drupal and Plone are also quite popular as content centric frameworks for building applications but they tend to dominate the overall architecture (unless you use Plone with Enfold Entransit).

You have plenty of options that will allow you to avoid brewing your own content management functionality. Consider them!