Monitoring your hot standby

As I mentioned in “Marketing I.T. in the Cloud,” we maintain a hot-standby infrastructure on another hosting service. We automatically refresh this standby setup at a regular interval so that, if something really bad happened to our primary infrastructure, we could just re-point our domains and relatively recent versions of our sites would be back up.

Because this backup infrastructure is so critical, we monitor it just like our production infrastructure. One thing we have not been able to automate, however, is the check to make sure the front page is loading properly. The problem is that WordPress forces a redirect to a primary host name. That is, if you try to hit the IP address of the backup server, it will redirect to the primary domain of the site (like http://www.globalmarketingops.com) which is still running on the primary servers. The fact that the redirect even happened told you Apache is running and WordPress is able to read from the database. But you wouldn’t know if the pages were broken. This is a little like driving around with a flat spare tire. You know you have it but you don’t know it is useless until you try to use it.

Our work-around has been to manually edit our hosts file (which overrides the DNS) and browse the site on a regular basis. But that is a drag. I couldn’t find a ping service that allowed me to override the public DNS with temporary mappings. So I came up with something much simpler.

I edited the /etc/hosts file of the backup server to map the supported domains to itself (localhost). With that setting in place, we can run a script on the backup server that loads the home page of each of the sites and verifies that each homepage contains certain text. The script sends out a pass or fail notification. If the server was too broken to run the script we would not get the pass notifications and other alerts would sound. If we get a fail notification, we know that we need to fix the backup site.

This is a pretty common challenge so I hope others can benefit from this simple solution.

  • http://www.thinkcreative.com Mark Marsiglio

    Depending on how your failover system gets synchronized with the primary system, you should also consider pulling the same page from the production server and running a diff to compare them. If the synchronization were to fail and your front page still loaded but it had content from 3 months ago it would pass your test but might still be wrong.

    This is especially important when using a hot standby server for a news site with frequent updates.

    We use the same strategy but with different tools, which I outlined in my post here: http://www.thinkcreative.com/TC-Blog/Why-we-upgraded-from-Amazon-Web-Services

    • https://plus.google.com/u/0/102992494947018640072 Seth Gottlieb

      Great point Mark! Another colleague of mine pointed out that I should have been more clear that this technique is in addition to other monitoring functionality. We still ping the server directly (by IP address) to make sure that the server is listening properly and we have all of our ServerDensity monitors hooked up as well.