<!-- Content Here -->

Where content meets technology

Apr 12, 2012

Amazon CloudSearch

I just got an email announcement for Amazon's new CloudSearch service. This could be really cool if popular applications and frameworks like Wordpress, Drupal, Rails, and Django build modules/apps/extensions that interface with CloudSearch. It would be nice to have an alternative to Google Custom Search (which has gotten pretty expensive for small sites) or (in the case of Drupal) Acquia Search. I am already using Amazon's Simple Email Service (SES) thanks to Django-SES. It's extremely reliable and reasonably priced.

The speed with which Amazon is rolling out these services is truly amazing. It's like every system they build to run amazon.com (the ecommerce business) is a candidate for a new AWS product. My only concern is Amazon's long term ability to support this increasing portfolio of services. No doubt they will run into Google Wave-like burnouts that will shake the well earned trust of the development community. What will they do if a service turns out to be a bad business? Will they support it at a reduced level or shut it down completely?

Apr 03, 2012

Planning for Localization

Localization can be an elusive requirement for a website. During the platform selection process, internationalization is often listed as a "strong" requirement. Why wouldn't you want the ability to reach new markets? Then, during the implementation process, localizing gets downgraded as a "nice to have" or "future" requirement as time and resources dwindle and compromises need to be made to launch the primary language site. Eventually, localization becomes an "oh crap" requirement when someone expects to have a "Spanish version of the site up ASAP". After all, localization was part of the business case and a key selling point of the platform. Yes, localizing the site would have been as easy as hiring translators if only some accommodations had been made during initial implementation. But they were not. Hence the "oh crap."

While you may not have budget to localize your website during your initial CMS implementation, following these steps will make adding a new language easier down the road.

  1. Learn how internationalization works on your platform.

    Different platforms have different approaches and best practices for internationalization. If you make the modest investment to fully understand these techniques, you will at least make educated decisions about how to defer localization and not rule it out. If you are working with a systems integrator, which I highly recommend, make sure they have experience building (and maintaining) localized websites on the platform.

  2. Use the translation framework provided from the templating system.

    Any given web page will have a lot of static text that lives in templates (as opposed to in the content). For example, there might be a label on the upper right that says "search" or a copyright statement on the bottom of the page. It is not practical to manage these little "content crumbs" through your usual editorial workflow — they are small, numerous, easy to lose track of, and rarely change. But if you are planning on localizing your site, you don't want those strings in your templates either. It is much better manage these strings separately in "resource" or "message" files that can be shipped off to translators. Most templating languages come with a system for invoking externally managed strings. For example, in JSP Standard Tag Library, translated strings are invoked like this:


    <fmt:message key="some.text"/>

    In eZ Publish they look like this:


    {"Some text you are going to translate"|i18n('design/standard/node')}

    Setting this up is easy enough to do when you are building out your templates for the first time. However, it is very tedious to retrofit old mono-lingual templates with this system. You wind up doing it for templates that are no longer used "just in case." Worse of all, you have to visually re-inspect every pixel of the site because you are touching nearly every line of view code.

  3. Keep text out of images.

    Having text in images adds a lot of friction to the localization process. First, you must remember to keep the image source files on-hand so you can produce the translated versions. Second, the translation process goes through additional steps of extracting and then re-imported the localized text. Managing all of those image files can be a real pain too. It is much better to float text over image backgrounds for buttons, navigation, and slides. Incidentally, applying this practice will also help with SEO and accessibility.

  4. Make room for other languages.

    Think as your primary language as the first of several languages when you are designing your content repository and defining roles and workflows. These localized content collections need a place to live. They will need access control so that a local team can safely work in their content without risking the overall content repository. Pay special attention to how "global" reusable content components are managed and retrieved.

  5. Buy your top level domains NOW.

    If you will be publishing your sites to different markets, start working on acquiring those top level domains (like .fr or .es). It will be really embarrassing to enter into a new market only to find someone else squatting on your domain.

  6. Set the appropriate character encoding

    Most of the time this is a non-issue because most modern technologies default to UTF-8. Just make sure that you set up your databases with UTF-8 encoding and collation. Some older versions of programming languages require adjustments when dealing with Unicode too.

If you think localization may be in your future, plan for it now. Take the additional steps to reduce rework and risk when you are under the gun to get that new language published. If you didn't follow this advice, I would look into translation proxies.

Jan 17, 2012

Fun with static publishing

In the old days, static publishing (or baking, where the CMS generates static HTML files at publish time), was pretty much the standard. Most of the WCM products on the market did static publishing: Interwoven, Tridion, RedDot, Percussion .... Even the frying systems like Vignette and FatWire (FutureTense/OpenMarket back then) relied so heavily on caching that they were practically baking style systems. Computing power was so expensive back then that you didn't dare to do much processing at request time.

Then frying-style systems became the norm. WCM vendors needed to have an answer to market demand for personalization and other dynamic behavior and the only way to do that was through dynamic (request-time) publishing. But static publishing isn't dead. There are many baking style systems on the market and lots of customers swear by static publishing. Good strategies have emerged to overcome its limitations. The primary benefits of static publishing are still very real: cost-savings, security, and stability. You can cheaply stand-up web servers that have the easy job of just serving up static HTML files.

And this brings me to my little obsession with static publishing. I am hosting a few sites on Amazon S3. The cost is ridiculously low and the speed is crazy-fast. Publishing them is fun too. For example, I publish my little personal site (www.sethgottlieb.com) using a site generator called Hyde, which is a Python port of a Ruby-based system called Jekyll. The way these generators work is that you enter your content in HTML, Markdown, or some other syntax and then run a script that renders static HTML pages with your presentation templates. Presentation templates can also do useful things like create listing pages. The Hyde sample site has a blog and there is a script to migrate from Wordpress. Versioning? Git, of course. In fact, using a source code control system will also allow multiple authors to collaborate too. Site generators also do useful things like minifying your CSS and JS files for maximum performance. Want interactive features like search or commenting? For search you could use Google Custom Search. Commenting can be supported through a service like Disqus. You can configure secure areas with your web servers. For little bits of interactivity, you could embed some client side Javascript or server side PHP in your static HTML files.

I also used static publishing to create an archive of a site that I no longer update. I created a Drupal website for my wife's birthday a few years ago. I didn't want to pay for LAMP hosting indefinitely but I didn't want to lose the site either. My solution was to pull the whole site down using wget and then upload it to S3. The site has lots of pages and I wouldn't want to manage them as static HTML but since I have no plans to change the site, I don't have to worry about it.

I am not saying that static publishing is a good idea all or even most of the time. Dynamic publishing opens up a whole new word of interactivity and personalization. Just don't write off static publishing too quickly — especially if your site doesn't change much and doesn't need to be very interactive.

Jul 18, 2011

PHP + JCR = PHPCR

My former colleague Lukas Kahwe Smith recently gave me a update on what is happening with the PHPCR initiative. Readers of this blog might remember me make a brief mention of PHPCR and the Jackalope implementation. My initial response was that I was unsure of whether the idea would take off but now I am pretty impressed with what I heard from Lukas.

Lukas and his team from Liip AG have been contributing to Jackalope as part of a large custom, content-centric web application they are building for a client using Symfony2. Jackalope goes beyond standard relational database persistence by providing sophisticated content services like content hierarchy, tree traversal, versioning, and content staging — common weaknesses in homebrew CMSs.

I can see a number of benefits to a PHP developer using PHPCR

  • There is less to build than when working directly against a REST interface like Apache Sling. You don't have to worry about making requests and marshaling XML or JSON into programmable PHP objects.

  • Your code can store any type of data in the JCR (not just documents). Using CMIS would be a bit of a stretch for anything but document data. Liip has developed an object-to-JCR node mapping layer (called phpcr-odm. Part of the Doctrine project) that behaves like a PHP ORM service.

  • The persistence engine is abstracted so you can swap it out with something that meets your performance needs. Jackalope ships with Apache JackRabbit but there are also transports in the works for MongoDB and standard SQL databases. They should be mature by the end of the year.

  • You can use PHP to build delivery tiers and other web applications using content managed in a JCR-based CMS (such as Adobe CQ5, HippoCMS, or Magnolia).

If you are building a content-centric web application in PHP, and you find yourself doing unnatural things to a relational database to meet your requirements, consider using Jackalope or the Midgard PHPCR implementation (which is designed more for speed). You are probably already be using Lucene for search indexing, how much trouble can one more Java application be to manage on your infrastructure?

May 24, 2011

Architecture Aikido

Over the past year I have been doing some side work building a web application for a startup. The project has been very interesting and the process has helped me stay in touch with my inner developer. It has also allowed me to practice agile product management philosophy to an extreme.

Last week I read Andy Hunt's article "Uncomfortable with Agile" and I have been ruminating on that feeling of being in a constant state of discomfort. If you click through and read the article, you will learn that discomfort is a good thing. The alternative is blissfully trusting a process — and then finding that you were wasting your time. Agile revolves around challenge. You challenge your assumptions and your limitations. That discomfort that you feel is really the sense of awareness that everything is in play and you are personally are accountable for a successful outcome.

Recently, I have been feeling like architecture in an agile environment is a little like wrestling or aikido. The goal is to maintain your balance and stay on your feet. Your opponent (the customer) pushes requirements in one direction and you respond by building up the architecture in that area like you might move your feet to stop yourself from being knocked over. Then, all of the sudden, the customer changes direction and you need to quickly adjust. The trick is to never over-commit in one direction or the other. Over-committing will put your balance at risk because, if the direction of force changes, all your weight will be on the wrong foot.

We see over-committing in architecture all the time. You go in thinking one feature will be the center of the application but then you find that little thing you built on the side as an afterthought is the killer feature. This is especially true with startups that are constantly "pivoting" or redefining their product. In my experience, most over-commitment sins are done in the name of performance optimization. You structure the data or the code in such a way that you can cut some corners and shave some processing time. Sometimes these optimizations are absolutely necessary but they leave you exposed for the next weight-shift. When this happens, your next iteration is spent regaining the balance of the application (refactoring) rather than adding new features.

The next time you are in charge of the technical design of an agile project, think of it as an aikido match. Fluidly adjust to changing requirements but try to stay away from positions that extend too far from a position of balance. Otherwise plan to take time getting back on your feet.

Mar 01, 2011

The Placeholder Application Controller Pattern

One of the main benefits of using a coupled (aka "frying") web content management system (WCMS) is that you get a web application development framework with which to build dynamic content-driven applications. Like nearly all modern web application development frameworks, a coupled CMS provides an implementation of the MVC (Model-View-Controller) pattern. For the less technical reader, the MVC pattern is used in just about all software that provides a user interface. The essence is that the model (the data), the view (how the information is presented and interacted with), and the controller (the business logic of the application) are separate concerns and keeping them separate makes the software more maintainable. I will let Wikipedia provide a further explanation of MVC.

The MVC implementation that comes with a typical WCMS is less flexible than a generic, all purpose web application framework. Your WCMS delivery tier makes a lot of assumptions because it knows that you are primarily trying to publish semi-structured content to some form of document format — probably an HTML page, but possibly some XML-based syndication format. Your WCMS knows roughly what your model looks like (it is a semi-structured content item from the repository that has only the data types that the repository supports); it has certain way of interpreting URLS; and the output is designed to be cached for rapid page loads. Those assumptions are handy because they make less work for the developer. In fact, most of the work on a typical WCMS implementation is done in the templates. There is hardly any work done in the model or controller logic.

But there are times when the assumptions of the CMS are broken. Anyone who has implemented a relatively sophisticated website on a CMS has had the experience of either overloading or working around the MVC pattern that comes with the CMS. The approach that I want to talk about here is what I call the PAC (Placeholder Application Controller) pattern. In a nutshell, this pattern de-emphasizes the roles of the model and controller and overloads the view (template) to support logic for a mini-application. The content item goes from being the model to a placeholder on the website. The controller is used more like a switchboard to dial up the right template.

Here is a common example. Let's say that you are building a web site for an insurance company. Most of the pages on the site are pretty much static. But there is one page that has a calculator where a visitor enters in some information about what he wants to insure and gets back a recommended coverage amount and estimated premiums. It would be pretty silly to try to manage all the data that drives the coverage calculator as content in the CMS. Instead, you would probably want to write the calculator in client-side Javascript, copy it into a presentation template and then assign that presentation template to a blank content page with the title "Coverage Calculator." The Coverage Calculator page in the content repository is really just a placeholder that gives your Javascript application a URL on the site.

To a lesser extent, home pages often implement the PAC pattern. In this case, the home page might be a simple empty placeholder page that is assigned a powerful template that queries and features content from across the site. When the controller grabs the template and model, it may only think that it is rendering the content that is managed in the home page asset. Little does the controller know, the template is going to take over and start to act like a controller — grabbing other models and applying other templates to them.

Placeholder Application Controller is one of those patterns that, once you think about it, you realize you use it all the time. It is convenient and practical but be careful with it because it is easy to get carried away. The main risk of the PAC pattern is that you are going against the grain of the WCMS architecture. Templates are supposed to be for formatting only. You may be pushing the templating language a little farther than it was intended to go and your code may become unmanageable. You also may be short-circuiting the security controls provided by the controller. Some WCMS platforms have a pluggable architecture that allows 3rd party modules (programmed in something other than the template language) to step in and play the roles of model, view, and controller. This helps keep the architecture cleaner but there will always be some limitations on how these modules are allowed to work. After a certain point, you will be better off going with a generic web application framework that affords you more flexibility and just use the WCMS to publish content into your custom web application. But that is a much larger undertaking.

Jan 10, 2011

Repository-Based vs. Presentation-Based Search

Search is probably the most common visitor-facing requirement in web content management system implementations. Usually the requirement is written in terse form such as "basic search" or "advanced search." But there are many nuances that need to be accounted for. There are essentially two approaches to implementing search requirements: repository-based search and page-based search.

A repository-based search indexes content items in the content repository. A page-based search indexes the pages of the site. This distinction is more important than you might think — especially if the site design heavily re-uses content. Here is an example. Let's say your site has pictures that are presented in slideshows. The picture content type has a caption that is searchable. A page-based search for a word in the caption of a picture will return the slideshow(s) where the picture is used. A repository-based search will return the picture item itself — but what if there is no detail page for the picture content type? You might have to do something like create a fake detail page that redirects the user to a slideshow page. Another difference is that a page-based search will index text that is hard coded into the presentation templates. For example, you might have your hours of operation in the footer of every page of the site and a "Visit" page that contains the hours plus directions. If a visitor types "hours" into a page-based search engine, he will get every page on the site in the results. A repository-based search engine will return the "Visit" page.

Generally speaking, the search functionality that comes out of the box in a CMS is repository-based. This is necessary because content contributors need a repository-based search to navigate the repository and find content to work on. Some of this content has not yet even been published on the site. Whether you need a page-based search engine for your visitors to use will depend on the nature of your site. Most types of websites do better with a page-based visitor search because a page is a good enough proxy for a piece of content and page-based search engines are generally easier to set up (look how easy it is to set up Google Custom Search). However, page-based search doesn't work well for all sites. In an eCommerce site that has a product catalog, you want to index the products themselves, not all the pages where the products are promoted. If you have requirements for a fielded search, like finding calendar events that occur within a date range, you will also need a repository-based search that indexes individual fields.

So, next time you are thinking about search, think about whether you want the search engine to index the pages on your site or the content that is being presented in those pages. As with all requirements, the best way to capture search requirements is through scenarios that present real-world examples.

Mar 25, 2010

The Onion's Migration from Drupal to Django

There is a great Reddit thread on The Onion's migration from Drupal to Django. The Onion was one of the companies that I interviewed for the Drupal for Publishers report. One of the things I mention in the report is that The Onion was running on an early version (4.7) of Drupal. The Onion was one of the first high traffic sites to adopt Drupal and the team had to hack the Drupal core to achieve the scalability that they needed. While versions 5 and 6 of Drupal made substantial performance improvements, The Onion's version was too far forked to cleanly upgrade.

Still, The Onion benefited greatly from using Drupal. They were able to minimize up-front costs by leveraging Drupal's native functionality and adapt the solution as their needs changed. Scalability was a challenge but it was a manageable one. Even though forking the code base was not ideal, it was a better alternative than running into a brick wall and having to migrate under duress. The Drupal community also benefited from the exposure and learning that came from The Onion using Drupal. Everybody won &mdash how often can you say that?

I can understand the choice of Django 1.1 (current) over a hacked version of Drupal 4.7. Having built sites in both Drupal and Django, I can also see the appeal of using a Django over Drupal 6.16 (current). Django is a more programming-oriented framework and The Onion has programmers. Django is designed to be as straightforward and "Pythonic" as possible. Drupal tries to make it possible to get things done without writing any code at all; and if you can avoid writing code in Drupal, you should. As a programming framework, Drupal has more indirection and asserts more control over the developer. The Onion's staff of programmers clearly appreciate the programmatic control that Django affords and they are quite happy with their decision.

Feb 23, 2010

NoSQL Deja Vu

Around thirteen years ago, I helped build a prototype for a custom CRM system that ran on an object database (ObjectStore). The idea isn't quite as crazy as it sounds. The data was extremely hierarchical with parent companies and subsidiaries and divisions and then people assigned to the individual divisions. It was the kind of data model where nearly every query had several recursive joins and there were concerns about performance. Also, the team was really curious about object databases so it was a pretty cool project.

One thing that I learned during that project is that (at least back then) the object database market was doomed. The problem was that when you said "database," people heard "tables of information." When you said "data" people wanted to bring the database administrator (DBA) into the discussion. An object database, which has no tables and was alien to most DBAs, broke those two key assumptions and created an atmosphere of fear, uncertainty and doubt. The DBA, who built a career on SQL, didn't want to be responsible for something unfamiliar. The ObjectStore sales guy told me that he was only successful when the internal object database champion positioned the product as a "permanent object cache" rather than a database. By hiding the word "data," projects were able to fly under the DBA radar.

Fast forward to the present and it feels like the same conflict is happening over NoSQL databases. All the same dynamics seem to be here. Programmers love the idea of breaking out of old-fashioned tables for their non-tabular data. Programmers also like the idea of data that is as distributed as their applications are. Many DBAs are fearful of the technology. Will this marginalize their skills? Will they be on the hook when the thing blows up?

I don't know if NoSQL databases will suffer the same fate as object databases did back in the 90's but the landscape seems to have shifted since then. The biggest change is that DBAs are less powerful than they used to be. It used to be that if you were working on any application that was even remotely related to data, you had to have at least a slice of the DBA's time allocated to your project. Now, unless the application/business is very data centric (like accounting, ERP, CRM, etc.), there may not even be a DBA in the picture. This trend is a result of two innovations. First, is object relational mapping (ORM) technology where schemas and queries are automatically generated based on the code that the programmer writes. With ORM, you work in an object model and the data model follows. This takes the data model out of the DBA's hands. The second innovation is cheap databases. When databases were expensive, they were centrally managed and tightly controlled. To get access to a database, you needed to involve the database group. Now, with free databases, the database becomes just another component in the application. The database group doesn't get involved.

Now that the database is a decision made by the programmer, I think non-relational databases have a better chance of adoption. Writing non-SQL queries to modify data is less daunting for a programmer who is accustomed to working in different programming languages. Still, the programmer needs good tools to browse and modify data because he doesn't want to write code for everything. Successful NoSQL databases will have administration tools. The JCR has the JCR Explorer. CMIS has a cool Adobe Air-based explorer. Both of these cases are repository standards that sit above a (relational or non-relational) database but they were critical for adoption. CouchDB has an administration client called Futon but most of the other NoSQL databases just support an API. You also want to have the data accessible to reporting and business intelligence tools. I think that a proliferation of administration/inspection/reporting tools will be a good signal that NoSQL is taking off.

Another potential advantage is the trend toward distributed applications which breaks the model of having a centralized database service. Oracle spent so much marketing force building up their database as being the centralized information repository to rule the enterprise. In this world of distributed services talking through open APIs, that monolithic image looks primitive. What is more important is minimal latency, fault tolerance, and the ability to scale to very large data sets. A large centralized (and generalized) resource is at a disadvantage along all three of these dimensions. When you start talking about lots of independent databases, the homogeneity of data persistence becomes less of a concern. It's not like you are going to be integrating these services with SQL. If you did, your integration would be very brittle because these agilely-developed services are in a constant state of evolution. You just need to have strong, stable APIs to push and pull data in the necessary formats.

The geeky programmer in me (that loved working on that CRM project) is rooting for NoSQL databases. The recovering DBA in me cringes at the thought of battling data corruption with inferior, unfamiliar tools. In a perfect world, there will be room for both technologies: relational databases for relational data that needs to be centrally managed as an enterprise asset; NoSQL databases for data that doesn't naturally fit into a relational database schema or has volumes that would strain traditional database technology.

Jan 25, 2010

CMS Architecture: Managing Presentation Templates

Another geeky post...

In my last post, I described the relative merits of managing configuration in a repository vs. in the file system but excluded presentation templates even though how they are managed is just as interesting. Like configuration, presentation templates can be managed in the file system or in the content repository. Like with configuration, if you manage presentation templates in the repository, you need some way to deploy them from one instance of your site to another without moving the content over as well.

There are plenty of additional reasons why you would want to manage presentation templates on the file system. In particular, presentation templates are code and you want to be able to use proven coding tools and techniques to manage them. Good developers will be familiar with using a source code management system to synchronize their local work areas and branch/tag the source tree. Development tools (IDE's and text editors) are designed to work on files in a local file system. If you manage presentation templates in the repository you have to solve all sorts of problems like branching and merging and building a browser-based IDE or integrating with local IDEs. The latter can be done through WebDAV and I have also seen customers use an Ant builder in Eclipse to push a file with every time it has changed. Still, the additional complexity can create frustrating issues when the deployment mechanism breaks.

As much as it complicates the architecture, there is one very good case when you would want to manage presentation templates in the repository: when you have a centralized CMS instance that supports multiple, independently developed sub-sites. For example, lets say you are a university and each school or department has its own web developer that wants to design and implement his own site design. This developer is competent and trustworthy but you don't want to give him access to deploy his own code directly to the filesystem of the production server. He could accidentally break another site or, worse, bring down the whole server. You could centralize the testing and deployment of code, but that would just create a bottleneck. You could do something like put the CSS and JS in the repository and have him go all CSS Zen Garden, but sooner or later he will want to edit the HTML in the presentation templates.

In this scenario of distributed, delegated development, presentation templates are like content into two very important aspects:

  1. presentation templates need access control rules to determine who can edit what.

  2. presentations templates become user input (and user input should never be trusted).

The second point is really important. Just like you need to think twice when you allow a content contributor to embed potentially malicious javascript into pages, you need to worry that a delegated template developer can deploy potentially dangerous server side code. Once that code is on the filesystem of an environment it can create all sorts of mischief. It doesn't matter if it was intentional or not, if a programmer codes an infinite loop or compromises security, you have a problem. Using templating languages (like Smarty or Velocity) rather than a full programming language (like PHP or Java in JSP) will mitigate that risk but you still have to worry about the developer uploading a script that can run on your server. With staging and workflow, CMSs are good at managing semi-trusted content like presentation templates from distributed independent developers. There is a clear boundary between the runtime of the site and the underlying environment.

If your CMS uses file-system based presentation templates and you delegate sub-site development to the departments who own them, you should definitely put in place some sort of automated deployment mechanism that keeps FTP and SSH access out of the developers hands and reduces the potential for manual error. The following guidelines are worth following:

  • Code should always be deployed out of a source code system (via a branch or a tag). That way you will know what was deployed and you can redeploy the same tested code to different environments.

  • Deployments should be scripted. The scripts can manage the logic of what should be put where.

  • Every development team should have an integration environment where they can test code.

One of my clients uses a product called AnthillPro for deployments of all web applications and also presentation templates. It has taken a while to standardize and migrate all of the development teams but now I don't see how you can have a de-centralized development organization without it.

The other dimension to this problem is the coupling between the content model and the presentation templates. When you add an attribute to a content type, you need to update the presentation template to show it (or use it in some other way). The deployment of new presentation templates needs to be timed with content updates. Often content contributors will want to see the new attribute in preview when they are updating their content. Templates also need to fail gracefully when they request an attribute that does not yet exist or has not been populated yet. Typically, presentation templates evolve more rapidly than content models. After all, a change in a content model usually involves some manual content entry. In my scenario of the university, there is a benefit of centralizing the ownership of the content model. This allows content sharing across sites: if one department defines a news item differently than another department, it is difficult to have a combined news feed. Centralizing the content model will further slow its evolution because there needs to be alignment between the different departments.

Wow, two geeky posts in a row. I promise the next one will be less technical.

Next → Page 1 of 3