<!-- Content Here -->

Where content meets technology

Apr 15, 2013

Control vs. Delegation

A couple of months ago, I fell in love with the Platform as a Service (PaaS) Heroku for a new web application that we are building. Compared to managing a farm of EC2 instances and other Amazon Web Services (AWS) products, Heroku seemed like a dream. Set up is ridiculously easy and a simple Git push command deploys your application. You don't need to think through things like failover and escalation procedures. It is just supposed to work. Then I read this article about Heroku on the Rap Genius blog and all of the sudden the dream started to sound too good to be true. This article on The Virtualization Practice nicely summarizes the different perspectives between Rap Genius, Heroku, and New Relic. To summarize even further, Heroku and New Relic were not transparent about the performance of the entire system and this prevented Rap Genius from being able to cost-effectively scale their application.

In my case, I stuck with Heroku for a while and may have continued with it but I ran into an issue where I could not easily install a Python library and that hassle, combined with the seeds of doubt already sowed, sent me back to hosting directly on AWS. I still think that Heroku offers tremendous practicality and value to many different types of applications. But like with any choice, there are trade-offs. The big one with a PaaS like Heroku seems to be Control vs. Delegation. You can't really have both. If you want complete control over something, there is no way to avoid responsibility for it. With Heroku, if there is an incident or the site is slow, I would just have to say "Heroku is having a bad day," and get on with my own good day. I wouldn't be frantically trying to resolve the problem because my lack of control would render me powerless. Of course, that wouldn't protect me from blame. Blame would be applied retroactively for making the decision to host on Heroku in the first place (back when I had control).

With AWS, I have a little more control. I can design an architecture that spans different availability zones and regions. I can even create a mirror site on another hosting provider. But that means more effort and money. If I wanted even more control, I could host the sites on our own servers in some colocation facility. But every step towards full control doesn't necessarily reduce risk. It just shifts it. When I have more control, the sources of risk are my own failures in design and planning. And in my case, I just don't trust my knowledge of network and server administration to be better than the specialists. Like with any decision, you need to find a balance: you need enough control to fulfill your responsibilities but not so much that you exceed your capability to handle it.

Jul 05, 2012

Can Outages be Good for the Cloud?

I was just thinking about the recent AWS outage and came to the conclusion that infrequent events like this probably help Amazon. While the first wave of response is usually criticism and doubt, the end result is probably increased adoption. Here is why. I don't think that events like these are chasing anyone away from the cloud. The reality is that technology occasionally breaks — especially under extreme conditions. No matter where you have a data center, bad events are going to happen. I was just talking to a friend who works in the Baltimore area and their self-hosted site has been down for days due to power outages. When people take a close look, they realize that most cannot provide better availability than cloud computing resources.

Rather than abandon the cloud, I expect that most customers will do what I found myself doing: the opposite — increase their cloud investment. If they were in one availability zone, they will expand their cluster to get into multiple availability zones and even multiple regions (which I am doing). They will create warm standby servers to switch over to in the event of a catastrophic failure. All these changes increases their monthly AWS bill and leads to higher Amazon revenues. It's just like the insurance business. The best time to sell new policies is right after an unlikely disaster, which is also usually the least likely time for the disaster to happen again (especially when new controls are put in place to prevent it).

Jul 02, 2012

Cloud Blues

I have a site that was affected by the recent AWS outage that took out Netflix, Instagram, and Pinterest on Friday. I thought I was in the clear when I checked on Friday. But I when I checked again on Sunday evening, my site was down. The problem was that the database didn't come back correctly. Last night I went through the restore procedure and the restore tool has been stuck in process all night. After buying a support contract, I learned that it hangs when the backup is corrupt. All of the support documentation says to just wait it out. I am still working with support to get a good backup.

This is very frustrating. The only comfort is that sites much bigger and more important than mine have also been taken down. The news about those big sites makes it easier to explain the issue to my users. It definitely beats having to say that I did something stupid.