Monday, January 9, 2006

GSA: Meta-data not essential for search

A participant on the iECM mailing list recently posted a link to an article about the GSA concluding that metadata is not essential. The study, based largely on information from industry experts, found that search technology is good enough that full text indexing is sufficient and no manual human intervention is necessary.

I am sure that my taxonomist friends are working up a worthy rebuttal. But lets just consider the proposition that, at least in the case of normal textual information like this blog entry, manual keyword assignment is not essential. Of course, as the article states, this does not apply to graphical content or numerical data which cannot be parsed and into words that would match a textual search query. But in the case of text, is it reasonable to assume that the author will, in the course of writing, wind up using words that a prospective searcher will search for? There is the issue of synonyms and word choice, and word stems but that can be accounted for in a good search algorithm (When the query request contains "blog" also look for "web log," and "journal". When the query request is "running," look for "run"). Google seems to do a good job.

Interestingly, the commercial search engines all ignore keyword tagging because it is so often abused. I am reminded of the Extreme Programming philosophy about commenting your code (at least at the method level). The code itself should be clear about what it does without explanation by comments. The need for comments is a symptom of overly complex and, therefore, hard to maintain code. I re-read this post and it contains every keyword that I would have used to classify it (search, query, metadata, GSA).

To be honest, most clients I have worked with have either neglected or abused keywords. Either don't understand the value of keywords and don't bother, or they try to game the system so that their content gets put in visible places (yes, this even happens on corporate intranets).

So what if we said to our authors what the commercial search engines tell us: "Don't worry about meta-data tagging, just write good content and we will bring you the right readers." Where would we be?

But Metadata is not just keywords. Look at the basicLibrary of Congress search page. See how you can search on different metadata fields to get what you want? Metadata also helps content reuse. For example, if the title, summary, author, and other attributes of content are stored in structured way, they can be shown on pages that list many content assets, not just the detail page. A 50 word summary is more valuable than the first 50 words of a 10,000 word document (unless the author is especially good at getting to his point. I noticed that in this entry I lead in talking about iECM which this post has nothing to do with). Structuring a portion of content also helps with things like sorting (as in by date, author, etc.).

Metadata is what content management is. To quote a recent CMS Watch article by MarkLogic CEO Dave Kellog:

That is, while ECM tracks and manages a lot of information about the content, it actually does relatively little to help get inside content. Despite its middle name, ECM today isn't really about content. It's about metadata.
Without metadata, an ECM is just a file system with versioning.

So, it looks like authors are not off the hook. Interestingly, in the library world, the people who write the metadata are different from the people that write the content. Unfortunately this is too costly for most corporate environments that casually create and use content and don't have the budget for a full time librarian staff.