A participant on the iECM mailing list recently posted a link to an article about the GSA concluding that metadata is not essential. The study, based largely on information from industry experts, found that search technology is good enough that full text indexing is sufficient and no manual human intervention is necessary.
I am sure that my taxonomist friends are working up a worthy rebuttal. But lets just consider the proposition that, at least in the case of normal textual information like this blog entry, manual keyword assignment is not essential. Of course, as the article states, this does not apply to graphical content or numerical data which cannot be parsed and into words that would match a textual search query. But in the case of text, is it reasonable to assume that the author will, in the course of writing, wind up using words that a prospective searcher will search for? There is the issue of synonyms and word choice, and word stems but that can be accounted for in a good search algorithm (When the query request contains “blog” also look for “web log,” and “journal”. When the query request is “running,” look for “run”). Google seems to do a good job.
Interestingly, the commercial search engines all ignore keyword tagging because it is so often abused. I am reminded of the Extreme Programming philosophy about commenting your code (at least at the method level). The code itself should be clear about what it does without explanation by comments. The need for comments is a symptom of overly complex and, therefore, hard to maintain code. I re-read this post and it contains every keyword that I would have used to classify it (search, query, metadata, GSA).
To be honest, most clients I have worked with have either neglected or abused keywords. Either don’t understand the value of keywords and don’t bother, or they try to game the system so that their content gets put in visible places (yes, this even happens on corporate intranets).
So what if we said to our authors what the commercial search engines tell us: “Don’t worry about meta-data tagging, just write good content and we will bring you the right readers.” Where would we be?
But Metadata is not just keywords. Look at the basicLibrary of Congress search page. See how you can search on different metadata fields to get what you want? Metadata also helps content reuse. For example, if the title, summary, author, and other attributes of content are stored in structured way, they can be shown on pages that list many content assets, not just the detail page. A 50 word summary is more valuable than the first 50 words of a 10,000 word document (unless the author is especially good at getting to his point. I noticed that in this entry I lead in talking about iECM which this post has nothing to do with). Structuring a portion of content also helps with things like sorting (as in by date, author, etc.).
Metadata is what content management is. To quote a recent CMS Watch article by MarkLogic CEO Dave Kellog:
That is, while ECM tracks and manages a lot of information about the content, it actually does relatively little to help get inside content. Despite its middle name, ECM today isn’t really about content. It’s about metadata.
Without metadata, an ECM is just a file system with versioning.
So, it looks like authors are not off the hook. Interestingly, in the library world, the people who write the metadata are different from the people that write the content. Unfortunately this is too costly for most corporate environments that casually create and use content and don’t have the budget for a full time librarian staff.
Related posts:
- Open Source CMS Search Engine Specialized search engines that focus on a subset of...
- Content is not Data David Nüscheler, CTO of Day Software and spec lead...
- Feature Request: Audience-Inspired Keywords In order to maintain my vendor neutrality, I refrain...
- Semantic tagging on the cheap with a WYSIWYG editor I am surprised by how few companies employ the...
- Blogs, Wiki’s, etc. A couple of months ago a WCMS sales guy...


BTW, I noticed that the article that I link to is hosted on a Plone site if you wanted to know what Plone out of the box lookes like.
You are right that most clients ignore/neglect entering metadata. Infact i’ve seen that many of those neglect other best practices too. So for example, they donot prefer entering headline, byline, author and other information in separate fields. They would rather use a rich text editor and paste the whole damn article in one field!!
So even though there is a mandate at the organization level that content should be reusable and re-purposable it does not percolate to actual content guys. So when a content creator creates content, all they want is to finish their work. Who cares for best practices!
I guess human beings are by nature lazy
I think the only way to prevent this is to train, train and train and make sure the content creators know the importance of this.
Thanks for sharing your insights.
You say that “Without metadata, an ECM is just a file system with versioning.” I would go further and claim that version numbering is part of the metadata for a piece of content.
Looking at it this way gels nicely with Apoorv’s comment above. Some metadata, such as version number, can automatically be inferred from the context. Some metadata can be inferred from the content if you apply natural language tools and rules.
But this probably doesn’t cut it fully, more value can be added by applying more conceptual tags, and that remains, for now, a manual task. A task that goes against human nature. A lot more thought needs to go into this arena, e.g. if metadata can be applied by the context in which a piece of content is consumed not just the context in which it is created.