Friday, April 29, 2011

A Storm Warning with Cloud Computing

Rick Blaisdell, CTO, reflects on the Amazon Cloud outage and how it relates to ConnectEDU’s Cloud preparation.

April 29, 2011 - As the CTO, it seems like I get more attention when things go wrong than when they are going well. Simply performing is always the baseline requirement.  Most of the people I interact with know I drink the Cloud Cool-Aid, and when the Amazon outages were posted, people came running from all over asking me if I was nervous that ConnectEDU is also on a Cloud.  

There are actually many Cloud providers in the market. We use NaviSite as our provider and are running on the Cisco Unified Compute System (UCS).  NaviSite is an enterprise provider, so they not only helped build our Cloud, but they also manage and monitor it.  This means we have security experts, maintenance experts and a 24/7 monitoring facility to ensure the system is secure, scalable and reliable.  

As a standard, virtualization platforms (Clouds) have built-in failover mechanisms, so when a blade (computer) fails, the Clouds that were running on those blades are automatically moved to another blade.  These types of failures happen more often than most people would expect, and when they do, whatever virtualization platform that is used will take care of the issue automatically. Depending on how the system is setup, the user base may never experience an outage.  This is how it’s supposed to work. However, in extenuating circumstances, if a company hasn’t put the right number of backups in place, a major failover can cause the system to go into a panic (yes, that’s the technical term). This is when things get really ugly.

So, what can a technology department do to prepare for such a disaster?  If the companies that were affected by the Amazon outage had an active live site failover at another location, they would not have experienced a loss of service.  This is not inexpensive and everyone should weigh the risks and costs of how much and what type of redundancy they require to provide the uptime expected. The Amazon incident should remind us of what can go wrong in a physical or virtualized environment, and luckily ConnectEDU has made the appropriate precautions to avoid extreme downtimes.

- Rick Blaisdell
  Chief Technology Officer

1 comment:

  1. If you read the 5000+ word post-mortem from AWS you will see that it was human error that triggered the cascading failure of the US-EAST-1 availability zone's EC2, RDS and EBS services on April 21s. Customers running EC2 instances and using EBS were knocked out as EBS ran out of resources during a massive re-mirroring of EBS volumes. AWS currently lacks the customer tools to make it possible to achieve EBS redundancy outside of an availability zone. Yes, you could run EC2 instances in East and West availability zones but EBS is another matter. So I would look for AWS to re-architect availability zones for improved survivability and provide customers with tools so they can adequately plan for this type of failure going forward. And no, the cloud is not falling.

    ReplyDelete