Content Here | articles tagged "DevOps"

Sep 02, 2014

Radio Silence

You might have noticed that I haven't been posting here much over the last year. More likely, you haven't noticed at all. Blogs are often like that. They come and go depending on the whims of the blogger.

The reason for my silence is that my focus has shifted from pure content management to lean product development, web application development, and cloud-base architecture. At Lionbridge, I have been helping to build a business called Lionbridge onDemand. It has been a little over a year since our first sale and I am in a constant state of amazement about how things are growing. Like most of my projects, we followed a lean product development model of launching a minimum viable product and continually improving it to support the needs of our growing customer base. In this case, we started with a tiny website to translate videos. Since then we have grown to the point where:

We now have a wide array of services that support nearly 40 different content types. We even have services that do not involve translation at all such as proof reading and CRM data cleanup.
There are enterprise sites for many name-brand clients. It turns out that large organizations are full of individuals who prefer a simple consumer style eCommerce experience. With onDemand Enterprise, we can quickly create a custom site for a corporate account. These enterprise sites have the same simple "consumer" feel but they also have the ability to offer custom services and can tap into corporate payment channels.
We have a Public API with a developer program. We started with an API for eCommerce systems to translate product catalogs; but then we grew into a full featured translation API that can handle our full complement of content types. While there are other translation APIs out there, ours is unique because it exposes different translation services ranging from pure machine translation to professional translation by subject matter specialists.
We are now a global team working around the clock to deliver projects with industry leading quality. We have operations specialists in North America, Europe, and Asia. Managing a service like this requires a special mix of problem solving skills. This is content management at a level of complexity that I have not seen before. The process requires the constant attention of highly skilled and conscientious individuals.

The last point is my favorite. I am currently en route from Warsaw, Poland where I have spent the last five weeks working with the operations core team. It was an incredible experience to meet the people that I had only known through email and conference calls. From the start, I was impressed by their dedication and can do spirit. Now they feel like family.

So, back to this blog. What has made Content Here a worthy endeavor so far was that it provided a useful place for me to explore observations and concepts that I encountered as a content management professional. If I learned something, I would flesh out my knowledge by writing it here. If I did a decent job of explaining something to someone, I would reproduce that explanation here. If that was all blogging did for me, it would have been enough. But blogging provided so many more benefits. In particular, it was a way to connect with other people. I made many business connections and friendships with this blog.

I am hoping that I can continue experiencing these benefits by shifting the focus of this blog to be more aligned with my day to day work. It won't be a total departure from my older posts. My involvement in web development and open source software has stayed the same. But I will probably write a lot less about content management software, product selection, taxonomy, workflow, and other pure content management concepts. To be honest, I feel like there is not much more for me to write on those topics. There will be more posts about my experiences from growing a web business. Topics will include things like working with distributed teams, lean product development, customer support, web application development, and web operations.

Hopefully, along the way, I will meet new people who are struggling with these same topics. The web is a pretty big place so I think that chances are good.

posted at 09:17 · off-topic DevOps Web Operations Product Management open source

Dec 17, 2013

Evolutionary Software Development: A Model for Maintaining Web Sites and Applications

Since I got into software development back in 1995, I have been exposed to many different software development methodologies. I started out with a highly structured waterfall model (which is hard to avoid with fixed bid projects). Whenever possible, I have advocated agile approaches and I have been on teams that have practiced agile methods like Scrum with varying degrees of fidelity. Overall, I love the values of agile but the agile methodologies that I know of don't quite feel right for the kind of work I do now.

These days my job focuses on managing several different web sites and applications. Officially, you could say that these applications are in "maintenance mode" but the truth is that they were always in maintenance mode. All of these applications have been built on the philosophy of a minimum viable product. That is, we developed a minimal solution to test a) that the problem actually exists and b) that our idea is an effective way of solving it. The initial solutions were just complete enough to answer these questions and create a foundation for an evolutionary process towards continuous improvement. Everything since has been "maintenance."

The agile development methodologies that I know of don't quite fit because they all seem to focus on a notion of defining a 2-3 week "release" or "sprint" and putting the entire team onto cadence that begins with planning and ends with some kind of review. But my teams do not work that way. We release one feature at a time. We don't wait for some large milestone to push new code into production. In fact, we push a new feature as soon as it is ready. We see an un-published feature as a huge liability. Worse than simply not being able to get the benefit of the feature or see how the audience responds to it, unpublished code is a risk. When code sits in an unpublished state it needs to be maintained like working code. Otherwise, when you deploy it, you are likely to break other things. Also, large deployments are riskier than small deployments and make problems harder to diagnose.

We try to define features to be small as possible. Some features are "below the waterline." That is, they are not visible to the end user and are only deployed to support another feature in the future. A very common example of this pattern is that we usually do our database work up front. For example, if we need to build a new page design, we will start by deploying the new fields that the design requires. Then our content team can add the data. Later, once the data is in place, we can deploy our page templates and other application logic that relies on the new data. We also commonly deploy infrastructure upgrades ahead of the features that they support. We also try phasing in features with an initial simplistic implementation to test out the concept and then a series of enhancements to reach the full potential.

When you are maintaining a live application, the line between a bug and enhancement is blurred and the same team should responsible for both. Anyone on the team needs to be able to pull herself away from the feature she is working on to fix an urgent issue. This is another reason why keeping things small is very important. It is also why it is so important to have a very clearly defined prioritization protocol. Ours is:

Site down
Feature broken with no work around
Feature broken with work around
Revenue opportunity
Customer Usability
Operations Usability
Brand consistency

The product manager is in charge of setting the priorities and assigning high priority issues that need to get taken care of ASAP.

The lines between coding, content, infrastructure, and support are also blurred. Sometimes an issue is noticed and the cause is not immediately apparent. For example, let's say a customer cannot download a file. It could be that the file was not added in the first place. The code to retrieve the file could be broken. The server could be out of space and the file could not be saved. It could be a usability issue and the customer was clicking on the wrong link. Triaging and coordinating this work is critical needs to be core to the methodology.

Refactoring is another big part of the methodology. You often run into situations where a feature gets a lot of traction but not in the way that you expected. You might need to change how it was built to allow it to evolve in the right direction. You will also need to refactor your successful experiments to support active use. The concept of a "user story" (common in agile methodologies) does not lend itself well to code refactoring.

As you can see, traditional agile methodologies do not accommodate the day to day activities required to maintain and extend an actively used web application. This is because they are designed at their core to support software development projects where the team has the luxury of ignoring the operational aspects of running the application as a service. But that luxury comes at a price: alienation from critical feedback by actual consumers of the application. In agile, a "customer" role filters and simplifies end user needs and tastes. This customer makes a lot of guesses about what the end customers want. The guesses tend to be big and there is no process for correcting the wrong ones. The agile approaches that I know of seem better for building packaged software or for integration companies that are brought in to build something and leave. These are cases where you cannot or don't want to be told that you need improve something right after you thought you built it to specification.

Here is a very real world example to illustrate that point. We use LivePerson to chat with customers who are on our sites. Sometimes there are cases when a customer gets stuck because he does not notice or cannot understand an error message — such as one telling him that he needs to enter something into a certain field. If it is common enough, we will use that feedback to design, implement, and deploy a fix that makes the validation message more visible and understandable. Fixing this issue could be hugely important to conversions. But think about how this problem might be solved in Scrum. A "user story" would be added to the backlog. At some point, this story makes it into the next sprint. It needs to wait for the rest of the sprint to finish before it is deployed. If you do two-week sprints, that could be up to a four-week turn around. You might say that you would handle this in a maintenance track that runs in parallel to your real development. But I would counter that tweaks like this are where applications succeed and fail and should be the focus of (not the exception to) your process. In fact, I think you should think of everything as a tweak.

Despite the fact that lots of organizations are maintaining sophisticated web applications and want to experiment with new features as fast as possible, I have found little writing on formal methodologies to support this. The best writing comes from the DevOps community but they don't directly often address the actual software development as much as I would hope. I like Eric Ries's chapter on continuous deployment in Web Operations: Keeping the Data On Time

. Speaking of Eric Ries, lean startup literature is also a very good resource. But this community emphasizes customer development and doesn't explore the architecture/implementation side as much as I would like.

What really inspired me to write this post was this relatively obscure research paper called "An Online Evolutionary Approach to Developing Internet Services." This fit the bill when I was looking for a term to describe how we work. The word "evolution" is perfect. It captures the idea that development is a series of small modifications that respond to stresses and opportunities. Each modification is tested in the real world. The ineffective ones die off. The good ones are the foundation for future progress. It is true natural selection. We just try to be a little less random than biological evolution.

I would love to hear your feedback on this. I go back and forth between thinking that we are outliers for working in this way to thinking that this process is so universal that nobody thinks about it. I also think there is a lot to discuss with regards to testing, communicating changes to users, and localizing the solution.

posted at 12:35 · DevOps agile

Aug 21, 2013

Small Deployments

Last week, when we released a new version of Lionbridge onDemand, I was reminded of a valuable lesson: keep your deployments small. Up to that point, we had been upgrading the production environment on a regular basis — nearly daily. When a new feature was ready, we deployed it. But this development iteration had some systemic improvements (new service types, enterprise features…) so we held off to release everything at once. That was a mistake. The problem was that lots of configuration changes had been building up (new settings, new libraries on the server, etc.) and that led to a lot more work at launch time. As you would expect, some steps were missed that compromised functionality. It wasn't a huge deal. Nothing was obviously broken. There were just some "Oh SNAP!" moments along the way.

Thanks to that reminder, we are back on schedule with regular code deployments. Some of the updates will be "under the waterline" — invisible changes to the underlying architecture to support visitor facing features that we are still working on. The trick is to keep the back-log of un-deployed code to a minimum. In addition to reducing the risk of the actual deployment, smaller upgrades make it easier to troubleshoot issues because there are fewer potential causes. For further reading on this topic read Eric Reis's chapter on Continuous Deployment in Web Operations: Keeping the Data On Time

. David Hobbs also has some great articles on this topic in relation to website launches. The one that springs to mind is Site Launches: do as little as possible. Also, to work in this way, it is also critical that you use a mature code management strategy. A couple of months ago my team adopted the Git Flow branching strategy and it has been a huge help.

People getting used to this model of product management will find it counter-intuitive. They see upgrades as risky, scary things and the natural tendency is to avoid them. But the more you delay an upgrade, the scarier and riskier it gets. And that creates a vicious circle. Even if you know and agree with the small deployment approach, it is easy to justify delaying an upgrade because you want to do one more thing. That is what happened to me.

posted at 12:38 · DevOps development Product Management

Jul 23, 2013

Deployment Timing for a Global Business

Global businesses operate on a 24 hour workday. No matter what time it is, somebody is trying to get something done - whether it be a customer trying to interact with you or it could be an employee trying to do her job. This can make system maintenance (such as deploying new website functionality or upgrading some business application) a challenge. Of course, we do our best to minimize or eliminate the amount of downtime a deployment causes. In fact, my team regularly deploys new code across our applications several times day (yay, automation!). However, there are also deployments that you try to schedule outside of peak load. Perhaps, there will be brief periods when the system is unavailable or in an awkward transitional state. So the question is, when do you do these deployments? We look at three factors:

The first thing we look at is the business calendar to make sure we are aware of any events that might drive traffic. For us, this could be a marketing campaign or, for our business applications, it could a peak in workload (like a deadline). It's a good idea to have a mailing list of primary users and announce upgrade plans to them. They will tell you if there is an issue with your schedule.
The second thing we look at is daily load patterns. All of the systems that I manage are running on Amazon Web Services using Elastic Load Balancers so my favorite tool for this is a CloudWatch ELB Request Count report. That shows nice regular daily spikes of traffic to your sites.
The third, and possibly most important, thing we look at is our own work patterns. While the load patterns may tell you to do your deployments at 3AM, don't do it. You could be tired and careless at that time of night. Plus, you will probably want to go to bed right after and that would be the worst time to be sleeping. Configuration errors usually take a little while to surface. You don't want to wake up the next morning to find an inbox full of frustrated users who lost large chunks of productivity.

Using this logic, we typically find our best upgrade windows to be at unexpected times like 2PM Eastern on a Wednesday. During this time: most of Asia is sleeping; Europe is enjoying dinner; and most of the Americas is having lunch or in a post-meal coma. At this time, we have our full team on hand for testing and remediation if anything goes wrong. And we get to sleep well that night knowing if anything were to go wrong, it would have done so hours ago.

One caveat: this strategy probably won't work for you if you are delivering a high traffic service with a ridiculously stringent SLA. But if that is your business, you should probably have an around-the-clock, 3-shift dev-ops staff and lots of people planning and managing deployments.

posted at 10:00 · DevOps

Apr 15, 2013

Control vs. Delegation

A couple of months ago, I fell in love with the Platform as a Service (PaaS) Heroku for a new web application that we are building. Compared to managing a farm of EC2 instances and other Amazon Web Services (AWS) products, Heroku seemed like a dream. Set up is ridiculously easy and a simple Git push command deploys your application. You don't need to think through things like failover and escalation procedures. It is just supposed to work. Then I read this article about Heroku on the Rap Genius blog and all of the sudden the dream started to sound too good to be true. This article on The Virtualization Practice nicely summarizes the different perspectives between Rap Genius, Heroku, and New Relic. To summarize even further, Heroku and New Relic were not transparent about the performance of the entire system and this prevented Rap Genius from being able to cost-effectively scale their application.

In my case, I stuck with Heroku for a while and may have continued with it but I ran into an issue where I could not easily install a Python library and that hassle, combined with the seeds of doubt already sowed, sent me back to hosting directly on AWS. I still think that Heroku offers tremendous practicality and value to many different types of applications. But like with any choice, there are trade-offs. The big one with a PaaS like Heroku seems to be Control vs. Delegation. You can't really have both. If you want complete control over something, there is no way to avoid responsibility for it. With Heroku, if there is an incident or the site is slow, I would just have to say "Heroku is having a bad day," and get on with my own good day. I wouldn't be frantically trying to resolve the problem because my lack of control would render me powerless. Of course, that wouldn't protect me from blame. Blame would be applied retroactively for making the decision to host on Heroku in the first place (back when I had control).

With AWS, I have a little more control. I can design an architecture that spans different availability zones and regions. I can even create a mirror site on another hosting provider. But that means more effort and money. If I wanted even more control, I could host the sites on our own servers in some colocation facility. But every step towards full control doesn't necessarily reduce risk. It just shifts it. When I have more control, the sources of risk are my own failures in design and planning. And in my case, I just don't trust my knowledge of network and server administration to be better than the specialists. Like with any decision, you need to find a balance: you need enough control to fulfill your responsibilities but not so much that you exceed your capability to handle it.

posted at 09:29 · Infrastructure cloud DevOps PaaS

Where content meets technology